Important Note

This script and corresponding output use simulated data based on the original dataset I downloaded from the Alzheimer’s Disease Neuroimaging Initiative. Part of the data agreement terms for ADNI users is that data not be published externally, so I cannot directly share the data upon which I ran my own analysis. However, the simulated data structure (columns, number of rows, subject IDs, etc.) is identical to that downloaded from ADNI; the only differences are simulated tau-PET uptake values, age, sex, and cognitive assessment scores. Any reader who wishes to access the actual dataset used for my analysis should register for an ADNI account (free) and refer to the specific data files described in Data Understanding.

Additionally, this project has been published as a GitHub pages website containing figures and analysis created with the actual ADNI dataset, and can be found here: https://anniegbryant.github.io/DA5030_Final_Project/

Lastly, the Shiny app deployed based on this project can be found here: https://annie-bryant.shinyapps.io/TauPET_Shiny_App_Notebook/

   

Phase One: Business Understanding

Determining Business Objectives

For my final project for DA5030 “Data Mining and Machine Learning”, my objective is to leverage neuroimaging-based data to predict cognitive decline in subjects along the cognitive spectrum from cognitively unimpaired to severe dementia. The goal is to identify specific brain regions that, when burdened by Alzheimer’s Disease-related pathology, confer predictive power onto cognitive status, measured via neuropsychological assessment. Ideally, I would like to identify the regions of interest (ROIs) in the brain that change the most with decreasing cognitive ability and to refine a set of ROIs that collectively predict changes to cognitive assessment scores. This will be (tentatively) regarded as a success if one or more ROIs can explain more than 50% variance in cognitive assessment scores (i.e. R\(^2\) > 0.5).

Assessing the Situation

I will focus on one specific form of neuroimaging: Positron Emission Tomography (PET). PET imaging enables the visualization of specific molecular substrates in the brain through the use of radioactively-labeled tracers that bind the target substrate. In this case, I have chosen to focus on PET that binds to the protein tau, which exhibits characteristic misfolding in Alzheimer’s Disease (AD). Misfolded tau not only loses its normal function, but it also aggregates into intracellular neurofibrillary tangles (NFTs) that can disrupt neuronal signaling and promote neurodegeneration. This phenomenon typically follows an archetypical spreading pattern beginning in the entorhinal cortex, progressing out to the hippocampus and amygdala, and then spreading out beyond the medial temporal lobe to the limbic system and onto the neocortex. This staging pattern is well-defined following the seminal paper published by Braak & Braak in 1991; the stages of tau NFT pathology progression are now known as the Braak stages. There are six stages of tau NFT progression in total.

Such staging has traditionally only been possible at autopsy, as it requires careful immunohistochemical staining of several brain regions by an experienced neuropathologist. However, recent years have seen the development of tau-PET tracers that are specific to misfolded NFT tau. One tracer in particular, 18F-AV-1451, has become widely-used in the last few years as a non-invasive biomarker to measure regional accumulation of tau in the human brain. Tau-PET uptake correlates well with the typical postmortem Braak staging patterns (Schwarz et al. 2016) as well as cognitive status (Zhao et al. 2019). Recent studies have utilized machine learning algorithms with tau-PET neuroimaging, as well as other (relatively) non-invasive biomarkers including amyloid-beta PET and cerebrospinal fluid (CSF) protein measurements, to collectively predict onset of dementia (Mishra et al. 2017) or to predict the spread of tau NFT pathology in the brain (Vogel et al. 2019, 2020). However, longitudinal analysis of tau-PET accumulation and its relationship to cognition remains relatively unexplored as of yet, largely owing to the recentness of tau-PET tracer development.

Resource Inventory

Through my role as a research assistant at the MassGeneral Institute for Neurogenerative Disease, I have worked with the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data repository previously. ADNI is a tremendous resource for imaging-based and molecular biomarker data acquired from thousands of research participants across the country (see Acknowledgments for more information). In 2016, ADNI incorporated 18F-AV-1451 tau-PET neuroimaging into its imaging protocol, and has since amassed well over a thousand tau-PET scans since then. Researchers at UCSF have processed many of these images and quantified regional uptake of the tau-PET tracer, and have generously shared their regional tau-PET data for ADNI collaborators to access. ADNI has also compiled cognitive assessment scores for each subject. I will utilize these two resources to develop individual regression models as well as an ensemble model to predict cognitive decline as a function of pathological tau NFT accumulation throughout the brain.

Requirements, Assumptions, Constraints

The only constraint is that I cannot directly share the full dataset as downloaded from ADNI, though I encourage anyone interested in gaining access to register for free at http://adni.loni.usc.edu/. Instead, I will use the R library fakeR (vignette) to simulate the two datasets I will access from ADNI to publish in my GitHub repository, so that the interested reader can follow along with consistent data structures.

Terminology

Determine Data Mining Goals

My goal in this analysis is to develop a model that can predict change in cognitive status through some combination (linear or nonlinear) of multiple brain regions, each of which exhibit a different change in tau-PET uptake. In doing this, I also hope to identify which region(s) of the brain are most prone to accumulation of tau NFT pathology as measured via PET, and in turn, which region(s) can best predict cognitive decline.

Data Mining Success Criteria

The target feature in this project will be a continuous measurement representing a score on a cognitive assessment score (CDR Sum of Boxes – see Data Understanding). Therefore, models will be evaluated based on their root mean squared error (RMSE) and the R\(^2\) between predicted versus real cognitive scores. I have set a benchmark of success at R\(^2\) > 0.5, meaning the model explains at least 50% of variance seen in cognitive score changes. This is an ambitious threshold, as cognitive status is multifactorial and certainly modulated by more than regional tau accumulation, but this figure will distinguish stronger versus weaker predictive models.

Phase 2: Data Understanding

All packages used in this file:

# General data wrangling
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.1
## v tidyr   1.1.1     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(DT)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(readxl)
library(fakeR)

# Modeling
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.0-2
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(ranger)
library(caretEnsemble)
## 
## Attaching package: 'caretEnsemble'
## The following object is masked from 'package:ggplot2':
## 
##     autoplot
library(Hmisc)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
# Visualization
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
## 
##     subplot
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(forcats)
library(ggsignif)
library(ggcorrplot)
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(colorRamps)
library(RColorBrewer)
library(colorspace)
library(NeuralNetTools)
library(ggplotify)
library(igraph)
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:plotly':
## 
##     groups
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
# ggseg is used to visualize the brain
# remotes::install_github("LCBC-UiO/ggseg")
# If that doesn't work: 
# download.file("https://github.com/LCBC-UiO/ggseg/archive/master.zip", "ggseg.zip")
# unzip("ggseg.zip")
# devtools::install_local("ggseg-master")
library(ggseg)

# remotes::install_github("LCBC-UiO/ggseg3d")
library(ggseg3d)

# remotes::install_github("LCBC-UiO/ggsegExtra")
library(ggsegExtra)

Tau-PET Data

Data overview

The longitudinal tau-PET dataset was downloaded as a CSV from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) Study Data repository located at Study Data/Imaging/PET Image Analysis/UC Berkeley - AV1451 Analysis [ADNI2,3] (version: 5/12/2020). This CSV file contains 1,121 rows and 241 columns. Note:ADNI data is freely accessible to all registered users. Please see my Acknowledgments page for more information about ADNI and its contributors.

Data loading

On my end, I load partial volume corrected regional tau-PET data, as downloaded from ADNI:

tau.df <- read.csv("../ADNI_Data/Raw_Data/UCBERKELEYAV1451_PVC_05_12_20.csv")
tau.df$EXAMDATE = as.Date(tau.df$EXAMDATE, format="%m/%d/%Y")

# update stamp is irrelevant, drop it
tau.df <- select(tau.df, -update_stamp)

However, since I can’t share the tau-PET data directly from ADNI, I’ve simulated this dataset using the fakeR library with the following code:

set.seed(129)
# use simulate_dataset from fakeR
# stealth.level=2 means each column is simulated independently
tau.sim <- simulate_dataset(tau.df, digits=5, stealth.level=2, 
                            ignore=c("RID", "VISCODE", "VISCODE2", "EXAMDATE")) %>%
  select(-update_stamp)
write.csv(tau.sim, "Simulated_ADNI_TauPET.csv", row.names = F)

The simulated tau-PET dataset can be loaded as follows:

tau.df <- read.csv("Simulated_ADNI_TauPET.csv")
tau.df$EXAMDATE = as.Date(tau.df$EXAMDATE, format="%Y-%m-%d")

Each row in the CSV represents one tau-PET scan (see str call below). Some subjects had repeated scans separated by approximately one year, while other subjects had only one scan. Columns include subject information including anonymized subject ID, visit code, and PET exam date. The other columns encode regional volume and tau-PET uptake. Specifically, there are 123 distinct cortical and subcortical regions of interest (ROIs), each of which has a volume field (in mm^3) and a tau-PET uptake field, called the Standardized Uptake Value Ratio (SUVR).

str(tau.df)
## 'data.frame':    1120 obs. of  164 variables:
##  $ RID                                   : int  21 31 31 56 56 56 59 69 69 69 ...
##  $ VISCODE                               : chr  "init" "init" "y1" "init" ...
##  $ VISCODE2                              : chr  "m144" "m144" "m156" "m144" ...
##  $ EXAMDATE                              : Date, format: "2018-02-02" "2018-04-24" ...
##  $ INFERIOR_CEREBGM_SUVR                 : num  0.342 0.452 0.129 0.145 2.957 ...
##  $ INFERIOR_CEREBGM_VOLUME               : num  56383 65081 56900 62775 52594 ...
##  $ HEMIWM_SUVR                           : num  0.954 0.966 1.004 1.081 1.119 ...
##  $ HEMIWM_VOLUME                         : num  483990 403653 429393 317379 408757 ...
##  $ BRAAK12_SUVR                          : num  2.09 1.46 1.54 1.93 1.35 ...
##  $ BRAAK12_VOLUME                        : num  10119 8682 13232 10611 12208 ...
##  $ BRAAK34_SUVR                          : num  1.64 2.72 1.93 1.47 1.8 ...
##  $ BRAAK34_VOLUME                        : num  116705 97847 106122 122590 103573 ...
##  $ BRAAK56_SUVR                          : num  2.39 1.99 1.84 1.65 2.38 ...
##  $ BRAAK56_VOLUME                        : num  298076 339962 341138 334962 315329 ...
##  $ BRAIN_STEM_SUVR                       : num  1.17 1.23 1.14 1.22 1.18 ...
##  $ BRAIN_STEM_VOLUME                     : num  20725 23551 14970 22162 22223 ...
##  $ LEFT_MIDDLEFR_SUVR                    : num  2.156 1.426 1.938 0.853 1.316 ...
##  $ LEFT_MIDDLEFR_VOLUME                  : num  20625 22012 19429 16616 21708 ...
##  $ LEFT_ORBITOFR_SUVR                    : num  2.02 1.65 2.55 1.52 2.15 ...
##  $ LEFT_ORBITOFR_VOLUME                  : num  11753 12113 11649 12066 11627 ...
##  $ LEFT_PARSFR_SUVR                      : num  1.87 2 1.89 1.71 2.23 ...
##  $ LEFT_PARSFR_VOLUME                    : num  9990 10503 10211 10983 7135 ...
##  $ LEFT_ACCUMBENS_AREA_SUVR              : num  1.91 2.04 1.5 3.22 1.5 ...
##  $ LEFT_ACCUMBENS_AREA_VOLUME            : num  355 325 831 418 691 ...
##  $ LEFT_AMYGDALA_SUVR                    : num  0.923 2.092 1.41 0.789 1.76 ...
##  $ LEFT_AMYGDALA_VOLUME                  : num  987 1284 841 1708 1639 ...
##  $ LEFT_CAUDATE_SUVR                     : num  1.36 1.35 1.99 1.31 1.16 ...
##  $ LEFT_CAUDATE_VOLUME                   : num  4226 3156 3132 4798 3438 ...
##  $ LEFT_HIPPOCAMPUS_SUVR                 : num  1.36 1.57 1.28 1.94 2.22 ...
##  $ LEFT_HIPPOCAMPUS_VOLUME               : num  2644 4958 2849 2426 2993 ...
##  $ LEFT_PALLIDUM_SUVR                    : num  3.05 1.91 2.2 2.12 2.12 ...
##  $ LEFT_PALLIDUM_VOLUME                  : num  1485 1354 1360 1446 1400 ...
##  $ LEFT_PUTAMEN_SUVR                     : num  1.81 2.07 1.07 1.91 1.73 ...
##  $ LEFT_PUTAMEN_VOLUME                   : num  6113 3982 6187 6479 4979 ...
##  $ LEFT_THALAMUS_PROPER_SUVR             : num  1.44 1.16 1.3 1.08 1.19 ...
##  $ LEFT_THALAMUS_PROPER_VOLUME           : num  4916 7297 6316 7172 6165 ...
##  $ RIGHT_MIDDLEFR_SUVR                   : num  2.16 1.74 2.41 1.41 1.56 ...
##  $ RIGHT_MIDDLEFR_VOLUME                 : num  23643 17915 21423 19796 22511 ...
##  $ RIGHT_ORBITOFR_SUVR                   : num  2.45 2.53 1.69 2.17 1.98 ...
##  $ RIGHT_ORBITOFR_VOLUME                 : num  12020 11800 12455 12198 11621 ...
##  $ RIGHT_PARSFR_SUVR                     : num  2.3 1.71 2.12 2.41 1.44 ...
##  $ RIGHT_PARSFR_VOLUME                   : num  7173 10629 10659 9281 8880 ...
##  $ RIGHT_ACCUMBENS_AREA_SUVR             : num  1.08 1.49 1.5 1.85 1.74 ...
##  $ RIGHT_ACCUMBENS_AREA_VOLUME           : num  452 424 375 550 566 ...
##  $ RIGHT_AMYGDALA_SUVR                   : num  0.359 0.987 1.605 1.883 2.486 ...
##  $ RIGHT_AMYGDALA_VOLUME                 : num  1462 1637 781 1317 1882 ...
##  $ RIGHT_CAUDATE_SUVR                    : num  1.66 1.15 1.83 1.86 1.39 ...
##  $ RIGHT_CAUDATE_VOLUME                  : num  4361 4264 3728 4369 4000 ...
##  $ RIGHT_HIPPOCAMPUS_SUVR                : num  1.48 1.99 1.06 1.43 2.05 ...
##  $ RIGHT_HIPPOCAMPUS_VOLUME              : num  3550 4934 3266 4490 3745 ...
##  $ RIGHT_PALLIDUM_SUVR                   : num  1.93 2.09 1.72 2.18 2.22 ...
##  $ RIGHT_PALLIDUM_VOLUME                 : num  1730 1335 1409 899 1502 ...
##  $ RIGHT_PUTAMEN_SUVR                    : num  1.83 1.61 1.31 1.49 1.33 ...
##  $ RIGHT_PUTAMEN_VOLUME                  : num  5996 6245 5846 3153 5146 ...
##  $ RIGHT_THALAMUS_PROPER_SUVR            : num  1.23 1.18 1.44 1.17 1.64 ...
##  $ RIGHT_THALAMUS_PROPER_VOLUME          : num  7278 5221 6663 7635 6595 ...
##  $ CHOROID_SUVR                          : num  4.03 2.26 3.23 3.61 1.44 ...
##  $ CHOROID_VOLUME                        : num  3250 3629 5913 4229 4444 ...
##  $ CTX_LH_BANKSSTS_SUVR                  : num  1.42 1.9 1.14 2.9 3.7 ...
##  $ CTX_LH_BANKSSTS_VOLUME                : num  2397 1945 2031 2778 1494 ...
##  $ CTX_LH_CAUDALANTERIORCINGULATE_SUVR   : num  1.32 1.87 1.63 1.62 2.04 ...
##  $ CTX_LH_CAUDALANTERIORCINGULATE_VOLUME : num  1056 1447 1922 2292 830 ...
##  $ CTX_LH_CUNEUS_SUVR                    : num  1.87 1.4 1.33 2.85 2.03 ...
##  $ CTX_LH_CUNEUS_VOLUME                  : num  3024 2512 2730 3196 2642 ...
##  $ CTX_LH_ENTORHINAL_SUVR                : num  1.48 0.737 2.628 1.373 2.87 ...
##  $ CTX_LH_ENTORHINAL_VOLUME              : num  2536 2632 725 2091 1424 ...
##  $ CTX_LH_FUSIFORM_SUVR                  : num  1.74 2.19 1.41 3.37 2.63 ...
##  $ CTX_LH_FUSIFORM_VOLUME                : num  7062 8214 8871 7792 9423 ...
##  $ CTX_LH_INFERIORPARIETAL_SUVR          : num  1.504 2.807 2.974 2.104 0.895 ...
##  $ CTX_LH_INFERIORPARIETAL_VOLUME        : num  9906 11466 11637 13180 9685 ...
##  $ CTX_LH_INFERIORTEMPORAL_SUVR          : num  1.19 2.77 2.24 1.26 3.73 ...
##  $ CTX_LH_INFERIORTEMPORAL_VOLUME        : num  9919 9647 10694 7474 6432 ...
##  $ CTX_LH_INSULA_SUVR                    : num  1.95 1.55 1.8 1.5 1.55 ...
##  $ CTX_LH_INSULA_VOLUME                  : num  5793 7211 5311 7230 7132 ...
##  $ CTX_LH_ISTHMUSCINGULATE_SUVR          : num  2.32 2.54 1.66 2.36 1.05 ...
##  $ CTX_LH_ISTHMUSCINGULATE_VOLUME        : num  1907 2075 2882 2013 2545 ...
##  $ CTX_LH_LATERALOCCIPITAL_SUVR          : num  2.46 2.75 2.49 2.41 3.66 ...
##  $ CTX_LH_LATERALOCCIPITAL_VOLUME        : num  7550 10710 10421 7259 11461 ...
##  $ CTX_LH_LINGUAL_SUVR                   : num  2.11 2.5 1.51 2.34 2.18 ...
##  $ CTX_LH_LINGUAL_VOLUME                 : num  6765 4697 6816 6056 4720 ...
##  $ CTX_LH_MIDDLETEMPORAL_SUVR            : num  1.875 2.573 1.462 0.896 2.273 ...
##  $ CTX_LH_MIDDLETEMPORAL_VOLUME          : num  9657 10869 6444 9428 10033 ...
##  $ CTX_LH_PARACENTRAL_SUVR               : num  1.37 2.08 1.77 1.62 1.83 ...
##  $ CTX_LH_PARACENTRAL_VOLUME             : num  3004 2869 2835 2917 3642 ...
##  $ CTX_LH_PARAHIPPOCAMPAL_SUVR           : num  2 1.53 2.4 2.19 1.65 ...
##  $ CTX_LH_PARAHIPPOCAMPAL_VOLUME         : num  1704 3116 1873 1627 2168 ...
##  $ CTX_LH_PERICALCARINE_SUVR             : num  1.253 1.695 1.557 0.918 1.628 ...
##  $ CTX_LH_PERICALCARINE_VOLUME           : num  1878 1491 1933 1844 1980 ...
##  $ CTX_LH_POSTCENTRAL_SUVR               : num  2.12 1.57 2.08 1.9 2.09 ...
##  $ CTX_LH_POSTCENTRAL_VOLUME             : num  8950 6142 9305 8798 8728 ...
##  $ CTX_LH_POSTERIORCINGULATE_SUVR        : num  1.64 1.1 1.32 1.75 1.92 ...
##  $ CTX_LH_POSTERIORCINGULATE_VOLUME      : num  3152 2631 2581 1706 3536 ...
##  $ CTX_LH_PRECENTRAL_SUVR                : num  1.77 2.18 1.6 1.33 1.84 ...
##  $ CTX_LH_PRECENTRAL_VOLUME              : num  10643 13116 13763 9665 12526 ...
##  $ CTX_LH_PRECUNEUS_SUVR                 : num  1.909 0.468 0.583 1.627 3.145 ...
##  $ CTX_LH_PRECUNEUS_VOLUME               : num  8184 7041 7739 9173 8416 ...
##  $ CTX_LH_ROSTRALANTERIORCINGULATE_SUVR  : num  1.58 1.7 1.81 1.35 2.19 ...
##  $ CTX_LH_ROSTRALANTERIORCINGULATE_VOLUME: num  2737 2240 2087 2341 2199 ...
##  $ CTX_LH_SUPERIORFRONTAL_SUVR           : num  1.448 1.731 1.1 2.507 0.999 ...
##   [list output truncated]

The SUVR value is normalized to the tau-PET uptake in the inferior cerebellum gray matter (highlighted in blue below), a commonly-used region for tau normalization given the lack of inferior cerebellar tau pathology in Alzheimer’s Disease.

aseg_3d %>% 
  unnest(ggseg_3d) %>% 
  ungroup() %>% 
  select(region) %>% 
  na.omit() %>% 
  mutate(val=ifelse(region %in% c("Right-Cerebellum-Cortex", "Left-Cerebellum-Cortex"), 1, 0)) %>%
  ggseg3d(atlas=aseg_3d, label="region", text="val", colour="val", na.alpha=0.5, 
           palette=c("transparent", "deepskyblue3"), show.legend=F) %>%
  add_glassbrain() %>%
  pan_camera("left lateral") %>%
  remove_axes()
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

All cortical and subcortical ROIs were delineated by first co-registering the tau-PET image to a high-resolution structural T1-weighted MPRAGE acquired in the same imaging session, and then applying FreeSurfer (v5.3) for automated regional segmentation and parcellation. The following diagram from Marcoux et al. (2018) summarizes this process:

Furthermore, to mitigate issues with lower voxel resolution in PET imaging, partial volume correction was applied to use probabilistic tissue segmentation maps to refine individual ROIs. Note: these PET processing steps were all performed by Susan Landau, Deniz Korman, and William Jagust at the Helen Wills Neuroscience Institute, UC Berkeley and Lawrence Berkeley National Laboratory.

18F-AV-1451 is a relatively recent PET tracer, and was only incorporated into the ADNI-3 pipeline beginning in 2016. I am curious about the temporal distribution of the FreeSurfer-analyzed scans here:

tau.df %>%
  # RID = unique subject identifier, EXAMDATE=date of PET scan
  select(RID, EXAMDATE) %>%
  # Create interactive plotly histogram, which auto-formats date along x-axis
  plot_ly(x=~EXAMDATE, type="histogram",
          marker = list(color = "lightsteelblue",
                        line = list(color = "lightslategray",
                                    width = 1.5))) %>%
  layout(title = 'Tau-PET Scan Date Distribution',
         xaxis = list(title = 'Scan Date',
                      zeroline = TRUE),
         yaxis = list(title = 'Number of PET Scans')) 

Even though ADNI3 officially began in 2016, most scans were acquired from mid-2017 to present day. This doesn’t affect this analysis, though, since the same tau-PET imaging protocol has been used across sites since 2016.

Data distribution

Since this project will explore tau-PET measurements over time, I will be refining the dataset to only subjects with multiple tau-PET scans. Here’s the overall distribution of number of longitudinal scans by subject:

# Plot number of PET scans by subject in a bar chart
p_num_long <- tau.df %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_scans=n()) %>%
  ggplot(., aes(x=fct_reorder(RID, n_scans, .desc=T), y=n_scans)) +
  geom_bar(stat="identity", aes(fill=n_scans, color=n_scans)) +
  labs(fill="Count", color="Count") +
  ggtitle("Number of Longitudinal PET Scans per Subject") +
  ylab("Number of PET Scans") +
  xlab("Subject") +
  theme(axis.text.x=element_blank(),
        plot.title=element_text(hjust=0.5)) 
## `summarise()` ungrouping output (override with `.groups` argument)
# Convert to interactive plotly chart
ggplotly(p_num_long)
rm(p_num_long)

The majority of subjects only had one tau-PET scan; given the longitudinal nature of this project, such subjects will eventually be omitted from analysis. On that note, it’s important to know exactly how many subjects do have at least two tau-PET scans:

# Calculate number of subjects with 2+ PET scans
num_scans <- tau.df %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  # Tabulate # scans by subject and filter
  summarise(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  ungroup() %>%
  summarise(num_subjects=n(),
            total_scans=sum(n_scans))
## `summarise()` ungrouping output (override with `.groups` argument)
# Print results
cat("Number of subjects with at least two scans: **", 
    num_scans$num_subjects, "**\n", 
    "\nNumber of total PET scans: **", 
    num_scans$total_scans, "**\n", sep="")

Number of subjects with at least two scans: 249

Number of total PET scans: 593

So, we have 249 subjects with two or more scans.

Temporal distribution

Another important consideration is the length of time between each consecutive scan. I will eventually normalize changes in tau-PET to number of years passed to yield an annual rate of change, but it’s good to know what that time interval is:

# Plot the # years in between each pair of consecutive tau-PET scans
p_pet_interval <- tau.df %>%
  select(RID, EXAMDATE) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  # Calculate difference between pairs of scan dates using lag
  mutate(Years_between_Scans = 
           as.numeric((EXAMDATE - lag(EXAMDATE, 
                                      default = EXAMDATE[1]))/365)) %>%
  # Omit the first scan, for which the time interval is zero
  filter(Years_between_Scans>0) %>%
  ggplot(., aes(x=Years_between_Scans)) +
  geom_histogram(stat="count", color="lightslategray") +
  ggtitle("Years in between Tau-PET Scans per Subject") +
  ylab("Frequency") +
  xlab("# Years between two consecutive scans for a subject") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5)) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad
# Convert to interactive plotly histogram
ggplotly(p_pet_interval)
rm(p_pet_interval)

There’s a cluster of scans around the one-year mark. Presumably, ADNI3 participants are enrolled in an annual tau-PET plan, though in some cases scans aren’t at precise yearly intervals.

Data missingness

I’ll check if there are any missing data points for tau-PET SUVR values for any of the FreeSurfer-derived cortical or subcortical regions of interest (ROIs). Note: this is filtered to show only subjects with at least two scans:

# Calculate number and proportion of missing data records for each measured region of interest (ROI)
tau.df %>%
  select(-VISCODE, -VISCODE2) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  ungroup() %>%
  # filter only to the SUVR columns
  select(!matches("VOLUME")) %>%
  # Reshape from wide --> long
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = str_replace(ROI, "_SUVR", "")) %>%
  group_by(ROI) %>%
  # Calculate number and proportion of missing data points for each ROI
  summarise(`Percent Missing` = sum(is.na(SUVR))/n(),
            `Number Missing` = sum(is.na(SUVR))) %>%
  datatable()
## `summarise()` ungrouping output (override with `.groups` argument)

All regions have zero missing data points, so no imputation will be necessary.

SUVR Distribution by Region

Now, I’ll check the distribution of tau-PET uptake values across the ROIs.

# plot the mean tau-PET SUVR for each region of interest
p_roi_suvr <- tau.df %>%
  # omit irrelevant variables
  select(-VISCODE, -VISCODE2) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  # only look at SUVR columns; reshape wide --> long
  select(!matches("VOLUME")) %>%
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = tolower(str_replace(ROI, "_SUVR", ""))) %>%
  group_by(ROI) %>%
  # calculate mean and SD, along with ymin and ymax, to plot error bars for each ROI
  summarise(Mean_SUVR=mean(SUVR, na.rm=T),
            SD_SUVR = sd(SUVR, na.rm=T),
            ymin = Mean_SUVR-SD_SUVR,
            ymax = Mean_SUVR+SD_SUVR) %>%
  # fct_reorder --> arrange ROI by its average tau-PET SUVR
  ggplot(data=., mapping=aes(x=fct_reorder(ROI, Mean_SUVR, .desc=F), 
                             y=Mean_SUVR,
                             label = ROI)) +
  geom_bar(stat="identity", show.legend=F, fill="lightsteelblue") +
  # Add error bars
  geom_errorbar(aes(ymin=ymin, ymax=ymax), width=0, color="lightslategray") +
  coord_flip() +
  theme_minimal() +
  ylab("Mean Tau-PET SUVR") +
  xlab("Region of Interest") +
  ggtitle("Mean Tau-PET SUVR by ROI") +
  theme(axis.text.x=element_text(size=8, angle=45))
## `summarise()` ungrouping output (override with `.groups` argument)
# Convert to interactive plotly graph
ggplotly(p_roi_suvr, height=1000, width=600, tooltip=c("label", "y"))
rm(p_roi_suvr)

ROI Normalization

These values are supposed to be normalized to the inferior cerebellum gray matter, indicated by INFERIOR_CEREBGM_SUVR. To confirm, I’ll check the distribution of INFERIOR_CEREBGM_SUVR values.

# Plot distribution of inferior cerebellum gray tau-PET SUVR
p_inf_cb <- tau.df %>%
  select(-VISCODE, -VISCODE2) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  # Select only SUVR columns for ROIs; pivot wide --> long
  select(!matches("VOLUME")) %>%
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = str_replace(ROI, "_SUVR", "")) %>%
  # Filter to inferior cerebellum gray
  filter(ROI=="INFERIOR_CEREBGM") %>%
  ggplot(data=., mapping=aes(x=SUVR)) +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ylab("Number of Occurences") +
  xlab("Inferior Cerebellum Gray SUVR") +
  ggtitle("Distribution of Inferior Cerebellum Gray Matter Tau Uptake") +
  theme(plot.title=element_text(hjust=0.5))

# Convert to interactive plotly histogram
ggplotly(p_inf_cb)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
rm(p_inf_cb)

Most of the inferior cerebellum gray ROIs actually have SUVRs around 1.25. I’ll re-normalize all ROI values to the mean of this distribution in Data Preparation.

Subject demographics + cognitive assessments

Data overview

The longitudinal subject demographic and cognitive assessment dataset was downloaded as a CSV from the ADNI Study Data repository, located at Study Data/Study Info/Key ADNI tables merged into one table. This CSV file contains 14,816 rows and 113 columns. This includes many subjects with multiple follow-up visits. Columns include (anonymized) subject demographic information such as age, sex, race, and marriage status. Note: ADNI data is freely accessible to all registered users. Please see my Acknowledgments page for more information about ADNI and its contributors.

This project will one key target feature in this dataset: Clinical Dementia Rating (CDR) Sum of Boxes. The CDR scale was initially developed in 1982 at the Washington University as a metric of clinical dementia progression (Hughes et al., 1982). This cognitive test measures impairment in six cognitive domains: memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care (Morris 1991). Each of these categories is scored on a three-point scale as follows:

  • 0 = no impairment
  • 0.5 = questionable
  • 1 = mild dementia
  • 2 = moderate dementia
  • 3 = severe dementia

The global score is most heavily influenced by the memory score, though there is an established decision tree-like algorithm for how to calculate the global score. By contrast, the CDR Sum of Boxes reflects the sum of scores for each of the six domains, with an overall range of 0 (no impairment) to 18 (severe impairment). The CDR Sum of Boxes is an extension upon the CDR global score, offering a more detailed quantitative score. This metric reportedly improves precision in longitudinal cognitive tracking, particularly in cases of mild dementia (O’Bryant et al., 2008). Of note, the CDR assessment shows high inter-rater reliability, which is important given the inter-site nature of the ADNI cohort (Morris 1991).

Sources:

  • Hughes, C. P., Berg, L., Danziger, W., Coben, L. A., & Martin, R. L. (1982). A new clinical scale for the staging of dementia. The British journal of psychiatry, 140(6), 566-572.
  • Morris, J. C. (1991). The clinical dementia rating (CDR): Current version and scoring rules. Neurology, 43(11), 1588-1592.
  • O’Bryant, S. E., Waring, S. C., Cullum, C. M., Hall, J., Lacritz, L., Massman, P. J., … & Doody, R. (2008). Staging dementia using Clinical Dementia Rating Scale Sum of Boxes scores: a Texas Alzheimer’s research consortium study. Archives of neurology, 65(8), 1091-1095.

Data loading

ADNI compiled a merged dataset containing key information from several tables, including subject demographics, selected cognitive assessment scores, and select biomarker data.

# Read in subject demographic dataset from ADNI
subj.info <- read.csv("../ADNI_Data/Raw_Data/ADNIMERGE.csv", stringsAsFactors = T, na.strings="")

Note: as with the tau-PET data, I will simulate the data I downloaded from ADNI so that interested readers can follow along, using the “Simulated_ADNI_cognitive_scores.csv” file in the GitHub repo main directory:

set.seed(127)
# subset data to be simulated:
data.to.sim <- select(subj.info, c(RID, VISCODE, EXAMDATE, AGE, PTGENDER, CDRSB))

# use simulate_dataset from fakeR
# stealth.level=2 means each column is simulated independently
subj.sim <- simulate_dataset(data.to.sim, digits=6, stealth.level=2, ignore=c("RID", "VISCODE", "EXAMDATE"))
write.csv(subj.sim, "Simulated_ADNI_cognitive_scores.csv", row.names = F)

The simulated cognitive scores dataset can be loaded as follows:

subj.info <- read.csv("Simulated_ADNI_cognitive_scores.csv")

Note: I am using the real ADNI data, so results will vary from those obtained with the simulated data.

The cognitive score dataset columns I will be using for this project include:

  • RID: Participant roster ID, which serves as unique subject identifier
  • VISCODE: Visit code
  • EXAMDATE: Date
  • AGE: Age at visit
  • PTGENDER: Biological sex
  • CDRSB: CDR Sum-of-Boxes score at visit
subj.info$EXAMDATE <- as.Date(subj.info$EXAMDATE, format="%m/%d/%Y")
summary(subj.info %>% select(RID, VISCODE, EXAMDATE, AGE, PTGENDER, CDRSB))
##       RID         VISCODE             EXAMDATE               AGE       
##  Min.   :   2   Length:14816       Min.   :2005-09-07   Min.   :44.93  
##  1st Qu.: 685   Class :character   1st Qu.:2008-12-23   1st Qu.:68.69  
##  Median :2074   Mode  :character   Median :2012-06-19   Median :73.41  
##  Mean   :2613                      Mean   :2012-05-02   Mean   :73.44  
##  3rd Qu.:4513                      3rd Qu.:2014-03-31   3rd Qu.:78.17  
##  Max.   :6874                      Max.   :2020-07-27   Max.   :99.03  
##    PTGENDER             CDRSB        
##  Length:14816       Min.   :-8.0267  
##  Class :character   1st Qu.: 0.1905  
##  Mode  :character   Median : 2.0928  
##                     Mean   : 2.1180  
##                     3rd Qu.: 4.0425  
##                     Max.   :13.2301

There is a lot of missing data in this dataset – however, this dataset includes many subjects and visit dates that don’t correspond to tau-PET scans, and therefore won’t be used in this analysis. Missingness will be re-evaluated once the PET data and subject demographic data is merged in Data Preparation.

Data distribution

The time distribution of ADNI cognitive assessment data can be visualized:

# plotly histogram of cognitive assessment dates
subj.info %>%
  select(RID, EXAMDATE) %>%
  plot_ly(x=~EXAMDATE, type="histogram",
          marker = list(color = "lightsteelblue",
                        line = list(color = "lightslategray",
                                    width = 1.5))) %>%
  layout(title = 'Subject Demographics Date Distribution',
         xaxis = list(title = 'Visit Date',
                      zeroline = TRUE),
         yaxis = list(title = 'Number of Subjects')) 

One thing to note is that tau-PET was only incorporated into the ADNI pipeline in late 2015/early 2016, so any demographic information from pre-2016 will not be included in modeling and analysis. Let’s check how many subjects had more than one visit recorded in this dataset:

# visualize distribution of # cognitive assessments per subject
p_subj_long <- subj.info %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_exams=n()) %>%
  # fct_reorder --> arrange subject on x-axis by number of exams, from large to small
  ggplot(., aes(x=fct_reorder(RID, n_exams, .desc=T), y=n_exams, label=RID)) +
  geom_bar(stat="identity", aes(fill=n_exams, color=n_exams)) +
  labs(fill="Count", color="Count") +
  ggtitle("Number of ADNI Visits per Subject") +
  ylab("Number of Visits") +
  xlab("Subjects") +
  theme(axis.text.x=element_blank(),
        plot.title=element_text(hjust=0.5)) 
## `summarise()` ungrouping output (override with `.groups` argument)
# convert to interactive plotly histogram
ggplotly(p_subj_long, tooltip = c("y"))
rm(p_subj_long)

Unlike with the PET data, most subjects have two or more visits recorded with cognitive and demographic information. The subjects in this dataset have up to 21 CDR-Sum of Boxes scores. To examine precisely how many subjects have at least two visits recorded:

num.subj <- subj.info %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  summarise(num_subjects=n(),
            total_exams=sum(n_exams))
## `summarise()` ungrouping output (override with `.groups` argument)
cat("Number of subjects with at least two ADNI visits: **", 
    num.subj$num_subjects, "**\n", 
    "\nTotal number of longitudinal ADNI visits recorded: **", 
    num.subj$total_exams, "**\n", sep="")

Number of subjects with at least two ADNI visits: 2027

Total number of longitudinal ADNI visits recorded: 14571

There are 2027 subjects with two or more ADNI visits in this dataset. This should include all subjects and visit dates included in the tau-PET dataset, which will be confirmed upon merging the datasets in the Data Preparation stage.

Temporal distribution

It’s also worth checking the distribution of time interval between ADNI visits in these 2027 subjects with two or more visits:

# Plot # years between consecutive cognitive assessments by subject
p.subj.interval <- subj.info %>%
  select(RID, EXAMDATE) %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  arrange(EXAMDATE) %>%
  # Use lag to calculate time interval between exams
  mutate(Years_between_ADNI = 
           as.numeric((EXAMDATE - lag(EXAMDATE, 
                                      default = EXAMDATE[1]))/365)) %>%
  # Omit first scan per subject, for which the time interval is zero
  filter(Years_between_ADNI>0) %>%
  ggplot(., aes(x=Years_between_ADNI)) +
  geom_histogram(stat="count", fill="lightsteelblue", color="lightslategray") +
  ggtitle("Years in between ADNI visits per Subject") +
  ylab("Frequency") +
  xlab("# Years between two consecutive ADNI visits") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5)) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad
# Convert to plotly interactive histogram
ggplotly(p.subj.interval)
rm(p.subj.interval)

Interestingly, there is a clear peak around 0.5 (six months) and a smaller peak around 1 (one year), indicating that most visits were spaced between 6 and 12 months apart. However, there is a positive skew to this distribution showing that a portion of subjects went up to five years in between visits. This is not likely to affect my analysis, as most tau-PET subjects had PET scans from 2018 onward, and would therefore have a follow-up interval of two years or less.

CDR-Sum of Boxes Scores

Now, I’ll look into the distribution of the target variable (CDRSB) and how they relate to other covariates, namely age and sex. These visualizations are filtered to show only those subjects with 2+ assessments.

# Plot CDR-SoB scores distribution across subjects with 2+ assessments
p.cdr.scores <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=CDRSB)) +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ylab("# of Occurences") +
  xlab("CDR-Sum of Boxes") +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes Distribution") +
  theme(plot.title=element_text(hjust=0.5))

# Convert to plotly interactive histogram
ggplotly(p.cdr.scores)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
rm(p.cdr.scores)

CDR-SoB by Age + Sex

Now, to stratify by age and sex, respectively:

# Violin plot by sex for CDR-SoB
p.cdr.sex.violin <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=PTGENDER, y=CDRSB)) +
  geom_violin(aes(fill=PTGENDER)) +
  theme_minimal() +
  ylab("CDR Sum of Boxes Score") +
  xlab("Biological Sex") +
  geom_signif(map_signif_level = F,
              test="t.test",
              comparisons=list(c("Female", "Male"))) +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes by Sex") +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")

# CDR-SoB histogram by sex
p.cdr.sex.hist <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=CDRSB)) +
  geom_histogram(aes(y=..count.., fill=PTGENDER)) +
  facet_wrap(PTGENDER ~ ., ncol=1) +
  theme_minimal() +
  ylab("Number of Subjects") +
  xlab("CDR Sum of Boxes Score") +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes Distribution by Sex") +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")


# Convert plots to interactive plotly visualizations
p.cdr.sex.violin <- ggplotly(p.cdr.sex.violin)
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomSignif() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues
p.cdr.sex.hist <- ggplotly(p.cdr.sex.hist)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# Create plotly layout using subplot, keep x and y axes distinct
plotly::subplot(p.cdr.sex.violin, p.cdr.sex.hist, shareX=F, shareY=F,
                titleX=T, titleY=T) %>% 
  # Manually supply x- and y-axis titles
  layout(xaxis = list(title = "Biological Sex", 
                      titlefont = list(size = 12)), 
         xaxis2 = list(title = "CDR Sum of Boxes Score", 
                       titlefont = list(size = 12)),
         yaxis=list(title="CDR Sum of Boxes Score", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)),
         yaxis3 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)))

rm(p.cdr.sex.hist, p.cdr.sex.violin)

The distribution of CDR Sum of Boxes score looks similar between Females and Males by eye, but I’ll follow up with a t-test to confirm:

# t-test for CDR-SoB by sex
t.test.df <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() 
t.test(CDRSB ~ PTGENDER, data=t.test.df)
rm(t.test.df)
## 
##  Welch Two Sample t-test
## 
## data:  CDRSB by PTGENDER
## t = -0.7254, df = 13748, p-value = 0.4682
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1282665  0.0589736
## sample estimates:
## mean in group Female   mean in group Male 
##             2.103838             2.138485

In fact, there is actually a statistically significant (p<0.05) difference in CDR Sum of Boxes scores between males and females, with male subjects exhibiting an average score ~0.15 points higher than female subjects. This is an important consideration, and I will be sure to include sex as a covariate in prediction models where applicable.

Similarly, I’ll compare CDR Sum of Boxes with age:

# Plot CDR-SoB by age in scatter plot, only for subjects with 2+ cognitive assessments
p.age.cdr <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  ggplot(data=., mapping=aes(x=AGE, y=CDRSB)) +
  labs(color="Number of Points") +
  xlab("Age at Visit") +
  ylab("CDR Sum of Boxes") +
  ggtitle("CDR Sum of Boxes vs. Age at Visit") +
  geom_point(size=1, alpha=0.2, color="lightslategray", fill="lightslategray") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")

# Plot histogram of age distribution for subjects with 2+ cognitive assessments
p.age.dist <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  ggplot(data=., mapping=aes(x=AGE)) +
  xlab("Number of Occurrences") +
  ylab("Age at Visit") +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ggtitle("CDR Sum of Boxes vs. Age at Visit") +
  theme(plot.title=element_text(hjust=0.5))

p.age.cdr <- ggplotly(p.age.cdr)
p.age.dist <- ggplotly(p.age.dist)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Create plotly layout using subplot, keep x and y axes distinct
plotly::subplot(p.age.cdr, p.age.dist, shareX=F, shareY=F,
                titleX=T, titleY=T, margin = 0.05) %>% 
  # Manually supply x- and y-axis titles
  layout(xaxis = list(title = "Biological Sex", 
                      titlefont = list(size = 12)), 
         xaxis2 = list(title = "Age at Visit", 
                       titlefont = list(size = 12)),
         yaxis=list(title="CDR Sum of Boxes Score", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)),
         autosize = F, width = 800, height = 500)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
rm(p.age.cdr, p.age.dist)

While there doesn’t appear to be any clear linear association between age at visit and CDR Sum of Boxes score, I’ll use cor.test to calculate the Pearson correlation coefficient and the corresponding p-value based on the correlation coefficient t-statistic:

# Correlation test between age and CDR-SoB for subjects with 2+ cognitive assessments
cor.test.df <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() 
cor.test(cor.test.df$AGE, cor.test.df$CDRSB)
## 
##  Pearson's product-moment correlation
## 
## data:  cor.test.df$AGE and cor.test.df$CDRSB
## t = -0.67886, df = 14569, p-value = 0.4972
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02185931  0.01061397
## sample estimates:
##          cor 
## -0.005624153

Interestingly, this correlation coefficient is statistically significant (p=8.512e-12), but the effect size is very small (r=0.067). This significance seems to reflect the sheer size of the dataset rather than a strong relationship between age and CDR sum of boxes scores. Nevertheless, I will explore the use of age as a covariate in modeling later as this is a common practice.

Outlier detection

To better identify outliers based on this multivariate dataset, I will calculate Cook’s distance for each data point once the tau-PET data is merged with the cognitive status data in the Data Preparation stage.

Phase Three: Data Preparation

Tau-PET Data

I’ll filter this tau-PET data to contain only subjects with 2+ tau-PET scans, and omit irrelevant columns:

tau.df <- tau.df %>%
  # Omit irrelevant columns
  select(-VISCODE, -HEMIWM_SUVR, -BRAAK12_SUVR,
         -BRAAK34_SUVR, -BRAAK56_SUVR, -OTHER_SUVR) %>%
  # Don't include volumetric data columns
  select(!matches("VOLUME")) %>%
  group_by(RID) 

# remove _SUVR from column names
colnames(tau.df) <- str_replace_all(colnames(tau.df), "_SUVR", "")
str(tau.df)
## tibble [1,120 x 78] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ RID                            : int [1:1120] 21 31 31 56 56 56 59 69 69 69 ...
##  $ VISCODE2                       : chr [1:1120] "m144" "m144" "m156" "m144" ...
##  $ EXAMDATE                       : Date[1:1120], format: "2018-02-02" "2018-04-24" ...
##  $ INFERIOR_CEREBGM               : num [1:1120] 0.342 0.452 0.129 0.145 2.957 ...
##  $ BRAIN_STEM                     : num [1:1120] 1.17 1.23 1.14 1.22 1.18 ...
##  $ LEFT_MIDDLEFR                  : num [1:1120] 2.156 1.426 1.938 0.853 1.316 ...
##  $ LEFT_ORBITOFR                  : num [1:1120] 2.02 1.65 2.55 1.52 2.15 ...
##  $ LEFT_PARSFR                    : num [1:1120] 1.87 2 1.89 1.71 2.23 ...
##  $ LEFT_ACCUMBENS_AREA            : num [1:1120] 1.91 2.04 1.5 3.22 1.5 ...
##  $ LEFT_AMYGDALA                  : num [1:1120] 0.923 2.092 1.41 0.789 1.76 ...
##  $ LEFT_CAUDATE                   : num [1:1120] 1.36 1.35 1.99 1.31 1.16 ...
##  $ LEFT_HIPPOCAMPUS               : num [1:1120] 1.36 1.57 1.28 1.94 2.22 ...
##  $ LEFT_PALLIDUM                  : num [1:1120] 3.05 1.91 2.2 2.12 2.12 ...
##  $ LEFT_PUTAMEN                   : num [1:1120] 1.81 2.07 1.07 1.91 1.73 ...
##  $ LEFT_THALAMUS_PROPER           : num [1:1120] 1.44 1.16 1.3 1.08 1.19 ...
##  $ RIGHT_MIDDLEFR                 : num [1:1120] 2.16 1.74 2.41 1.41 1.56 ...
##  $ RIGHT_ORBITOFR                 : num [1:1120] 2.45 2.53 1.69 2.17 1.98 ...
##  $ RIGHT_PARSFR                   : num [1:1120] 2.3 1.71 2.12 2.41 1.44 ...
##  $ RIGHT_ACCUMBENS_AREA           : num [1:1120] 1.08 1.49 1.5 1.85 1.74 ...
##  $ RIGHT_AMYGDALA                 : num [1:1120] 0.359 0.987 1.605 1.883 2.486 ...
##  $ RIGHT_CAUDATE                  : num [1:1120] 1.66 1.15 1.83 1.86 1.39 ...
##  $ RIGHT_HIPPOCAMPUS              : num [1:1120] 1.48 1.99 1.06 1.43 2.05 ...
##  $ RIGHT_PALLIDUM                 : num [1:1120] 1.93 2.09 1.72 2.18 2.22 ...
##  $ RIGHT_PUTAMEN                  : num [1:1120] 1.83 1.61 1.31 1.49 1.33 ...
##  $ RIGHT_THALAMUS_PROPER          : num [1:1120] 1.23 1.18 1.44 1.17 1.64 ...
##  $ CHOROID                        : num [1:1120] 4.03 2.26 3.23 3.61 1.44 ...
##  $ CTX_LH_BANKSSTS                : num [1:1120] 1.42 1.9 1.14 2.9 3.7 ...
##  $ CTX_LH_CAUDALANTERIORCINGULATE : num [1:1120] 1.32 1.87 1.63 1.62 2.04 ...
##  $ CTX_LH_CUNEUS                  : num [1:1120] 1.87 1.4 1.33 2.85 2.03 ...
##  $ CTX_LH_ENTORHINAL              : num [1:1120] 1.48 0.737 2.628 1.373 2.87 ...
##  $ CTX_LH_FUSIFORM                : num [1:1120] 1.74 2.19 1.41 3.37 2.63 ...
##  $ CTX_LH_INFERIORPARIETAL        : num [1:1120] 1.504 2.807 2.974 2.104 0.895 ...
##  $ CTX_LH_INFERIORTEMPORAL        : num [1:1120] 1.19 2.77 2.24 1.26 3.73 ...
##  $ CTX_LH_INSULA                  : num [1:1120] 1.95 1.55 1.8 1.5 1.55 ...
##  $ CTX_LH_ISTHMUSCINGULATE        : num [1:1120] 2.32 2.54 1.66 2.36 1.05 ...
##  $ CTX_LH_LATERALOCCIPITAL        : num [1:1120] 2.46 2.75 2.49 2.41 3.66 ...
##  $ CTX_LH_LINGUAL                 : num [1:1120] 2.11 2.5 1.51 2.34 2.18 ...
##  $ CTX_LH_MIDDLETEMPORAL          : num [1:1120] 1.875 2.573 1.462 0.896 2.273 ...
##  $ CTX_LH_PARACENTRAL             : num [1:1120] 1.37 2.08 1.77 1.62 1.83 ...
##  $ CTX_LH_PARAHIPPOCAMPAL         : num [1:1120] 2 1.53 2.4 2.19 1.65 ...
##  $ CTX_LH_PERICALCARINE           : num [1:1120] 1.253 1.695 1.557 0.918 1.628 ...
##  $ CTX_LH_POSTCENTRAL             : num [1:1120] 2.12 1.57 2.08 1.9 2.09 ...
##  $ CTX_LH_POSTERIORCINGULATE      : num [1:1120] 1.64 1.1 1.32 1.75 1.92 ...
##  $ CTX_LH_PRECENTRAL              : num [1:1120] 1.77 2.18 1.6 1.33 1.84 ...
##  $ CTX_LH_PRECUNEUS               : num [1:1120] 1.909 0.468 0.583 1.627 3.145 ...
##  $ CTX_LH_ROSTRALANTERIORCINGULATE: num [1:1120] 1.58 1.7 1.81 1.35 2.19 ...
##  $ CTX_LH_SUPERIORFRONTAL         : num [1:1120] 1.448 1.731 1.1 2.507 0.999 ...
##  $ CTX_LH_SUPERIORPARIETAL        : num [1:1120] 2.52 2.72 2.74 2.33 1.33 ...
##  $ CTX_LH_SUPERIORTEMPORAL        : num [1:1120] 2.19 2.16 2.79 2.09 1.95 ...
##  $ CTX_LH_SUPRAMARGINAL           : num [1:1120] 1.399 0.857 1.937 2.252 1.903 ...
##  $ CTX_LH_TEMPORALPOLE            : num [1:1120] 2.38 2.42 2.38 1.7 2 ...
##  $ CTX_LH_TRANSVERSETEMPORAL      : num [1:1120] 1.21 1.1 1.73 1.48 1.52 ...
##  $ CTX_RH_BANKSSTS                : num [1:1120] 1.3 2.05 2.37 2.86 2.77 ...
##  $ CTX_RH_CAUDALANTERIORCINGULATE : num [1:1120] 1.78 1.62 1.68 1.29 1.66 ...
##  $ CTX_RH_CUNEUS                  : num [1:1120] 1.653 3.179 2.55 0.859 1.552 ...
##  $ CTX_RH_ENTORHINAL              : num [1:1120] 0.968 1.259 1.375 1.722 1.379 ...
##  $ CTX_RH_FUSIFORM                : num [1:1120] 1.74 2.31 2.45 2.24 2.39 ...
##  $ CTX_RH_INFERIORPARIETAL        : num [1:1120] 1.412 3.249 3.106 0.483 2.451 ...
##  $ CTX_RH_INFERIORTEMPORAL        : num [1:1120] 2.02 1.64 1.4 2.74 1.96 ...
##  $ CTX_RH_INSULA                  : num [1:1120] 1.8 1.81 1.43 1.59 1.56 ...
##  $ CTX_RH_ISTHMUSCINGULATE        : num [1:1120] 1.16 1.32 1.55 1.93 2.05 ...
##  $ CTX_RH_LATERALOCCIPITAL        : num [1:1120] 2.61 2.98 3.61 2.57 1.38 ...
##  $ CTX_RH_LINGUAL                 : num [1:1120] 2.15 1.64 1.58 1.89 1.56 ...
##  $ CTX_RH_MIDDLETEMPORAL          : num [1:1120] 2.16 1.66 2.22 1.48 1.85 ...
##  $ CTX_RH_PARACENTRAL             : num [1:1120] 1.33 1.45 1.21 2.14 2.12 ...
##  $ CTX_RH_PARAHIPPOCAMPAL         : num [1:1120] 1.46 0.973 1.891 1.329 1.955 ...
##  $ CTX_RH_PERICALCARINE           : num [1:1120] 1.62 1.9 1.73 1.8 1.62 ...
##  $ CTX_RH_POSTCENTRAL             : num [1:1120] 1.68 1.49 1.55 2.19 1.74 ...
##  $ CTX_RH_POSTERIORCINGULATE      : num [1:1120] 2.27 1.13 1.64 1.64 1.75 ...
##  $ CTX_RH_PRECENTRAL              : num [1:1120] 1.69 1.77 1.81 1.67 1.83 ...
##  $ CTX_RH_PRECUNEUS               : num [1:1120] 1.71 1.1 1.78 1.28 2.39 ...
##  $ CTX_RH_ROSTRALANTERIORCINGULATE: num [1:1120] 1.63 2.38 2.12 1.86 1.86 ...
##  $ CTX_RH_SUPERIORFRONTAL         : num [1:1120] 1.46 2.03 2.16 2.07 2.25 ...
##  $ CTX_RH_SUPERIORPARIETAL        : num [1:1120] 3.14 1.83 2.53 1.72 1.54 ...
##  $ CTX_RH_SUPERIORTEMPORAL        : num [1:1120] 2.1 1.39 1.5 1.86 1.75 ...
##  $ CTX_RH_SUPRAMARGINAL           : num [1:1120] 1.96 2.33 2.31 2.14 2.18 ...
##  $ CTX_RH_TEMPORALPOLE            : num [1:1120] 1.63 1.54 1.9 3.02 1.57 ...
##  $ CTX_RH_TRANSVERSETEMPORAL      : num [1:1120] 1.15 1.22 1.76 1.41 1.18 ...
##  - attr(*, "groups")= tibble [776 x 2] (S3: tbl_df/tbl/data.frame)
##   ..$ RID  : int [1:776] 21 31 56 59 69 96 112 120 127 142 ...
##   ..$ .rows: list<int> [1:776] 
##   .. ..$ : int 1
##   .. ..$ : int [1:2] 2 3
##   .. ..$ : int [1:3] 4 5 6
##   .. ..$ : int 7
##   .. ..$ : int [1:3] 8 9 10
##   .. ..$ : int [1:3] 11 12 13
##   .. ..$ : int [1:3] 14 15 16
##   .. ..$ : int 17
##   .. ..$ : int 18
##   .. ..$ : int 19
##   .. ..$ : int 20
##   .. ..$ : int 21
##   .. ..$ : int 22
##   .. ..$ : int 23
##   .. ..$ : int [1:3] 24 25 26
##   .. ..$ : int 27
##   .. ..$ : int [1:2] 28 29
##   .. ..$ : int 30
##   .. ..$ : int 31
##   .. ..$ : int [1:2] 32 33
##   .. ..$ : int 34
##   .. ..$ : int 35
##   .. ..$ : int [1:2] 36 37
##   .. ..$ : int 38
##   .. ..$ : int [1:2] 39 40
##   .. ..$ : int [1:2] 41 42
##   .. ..$ : int 43
##   .. ..$ : int [1:2] 44 45
##   .. ..$ : int 46
##   .. ..$ : int 47
##   .. ..$ : int [1:3] 48 49 50
##   .. ..$ : int [1:2] 51 52
##   .. ..$ : int [1:3] 53 54 55
##   .. ..$ : int 56
##   .. ..$ : int [1:2] 57 58
##   .. ..$ : int 59
##   .. ..$ : int 60
##   .. ..$ : int 61
##   .. ..$ : int 62
##   .. ..$ : int 63
##   .. ..$ : int 64
##   .. ..$ : int 65
##   .. ..$ : int 66
##   .. ..$ : int 67
##   .. ..$ : int [1:4] 68 69 70 71
##   .. ..$ : int 72
##   .. ..$ : int [1:2] 73 74
##   .. ..$ : int [1:2] 75 76
##   .. ..$ : int 77
##   .. ..$ : int 78
##   .. ..$ : int [1:3] 79 80 81
##   .. ..$ : int [1:3] 82 83 84
##   .. ..$ : int 85
##   .. ..$ : int 86
##   .. ..$ : int 87
##   .. ..$ : int [1:2] 88 89
##   .. ..$ : int 90
##   .. ..$ : int 91
##   .. ..$ : int 92
##   .. ..$ : int [1:3] 93 94 95
##   .. ..$ : int [1:2] 96 97
##   .. ..$ : int 98
##   .. ..$ : int [1:2] 99 100
##   .. ..$ : int 101
##   .. ..$ : int 102
##   .. ..$ : int 103
##   .. ..$ : int 104
##   .. ..$ : int [1:5] 105 106 107 108 109
##   .. ..$ : int 110
##   .. ..$ : int [1:3] 111 112 113
##   .. ..$ : int [1:2] 114 115
##   .. ..$ : int 116
##   .. ..$ : int [1:4] 117 118 119 120
##   .. ..$ : int [1:2] 121 122
##   .. ..$ : int [1:4] 123 124 125 126
##   .. ..$ : int [1:3] 127 128 129
##   .. ..$ : int 130
##   .. ..$ : int 131
##   .. ..$ : int 132
##   .. ..$ : int 133
##   .. ..$ : int 134
##   .. ..$ : int [1:2] 135 136
##   .. ..$ : int 137
##   .. ..$ : int 138
##   .. ..$ : int [1:2] 139 140
##   .. ..$ : int 141
##   .. ..$ : int 142
##   .. ..$ : int 143
##   .. ..$ : int [1:3] 144 145 146
##   .. ..$ : int [1:2] 147 148
##   .. ..$ : int 149
##   .. ..$ : int [1:2] 150 151
##   .. ..$ : int 152
##   .. ..$ : int 153
##   .. ..$ : int 154
##   .. ..$ : int 155
##   .. ..$ : int 156
##   .. ..$ : int 157
##   .. ..$ : int 158
##   .. .. [list output truncated]
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

SUVR normalization

As shown in Data Understanding, the ROIs are not precisely standardized to the inferior cerebellum gray matter SUVR. I will re-standardize each region’s ROI SUVR values here:

# tau.stand = tau-PET dataframe with ROI SUVR values re-standardized to inferior cerebellum gray matter SUVR
tau.stand <- tau.df
# iterate over all ROI columns and standardize to inferior cerebellum gray -- tau.df[4]
for (i in 4:ncol(tau.stand)) {
  tau.stand[i] <- tau.stand[i]/ tau.df[4]
}
rm(tau.df)

Standardization can be verified using summary:

summary(tau.stand$INFERIOR_CEREBGM)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1

ROI selection, a priori

Now that regional SUVR is properly standardized, the next step is to select brain regions based on a priori knowledge of where and how tau affects the brain in MCI/AD. I am going to stratify the cortical parcellations and subcortical segmentations based on Schöll et al. (2016) and per UCSF’s recommendations for usage of their tau-PET data. Here is the stratification across the Braak stages:

# Display the UCSF & Scholl 2016 ROI Braak stage datatable
roi.braak <- read.csv("RData/roi_braak_stages.csv") %>% 
  mutate(ROI_Name = tolower(ROI_Name)) %>%
  mutate(Hemisphere = ifelse(str_detect(ROI_Name, "rh_|right"), "Right", "Left"))
datatable(roi.braak %>% select(-Base_Name))

The following plots show the spatial relationship of the Braak stages in the brain, in the cortical (top) and subcortical (bottom) ROIs:

# Load dataframes that link the ADNI tau-PET ROIs with nomenclature used in the ggseg package
ggseg.aparc <- read_excel("RData/ggseg_roi.xlsx", sheet=1) %>%
  mutate(Braak=as.numeric(as.roman(Braak)))
ggseg.aseg <- read_excel("RData/ggseg_roi.xlsx", sheet=2) %>%
  mutate(Braak=as.numeric(as.roman(Braak)))

# Plot the Desikan-Killiany cortical parcellation atlas
p.aparc <-  dk %>% filter(hemi=="right") %>% 
  unnest(ggseg) %>% 
  select(region) %>% 
  na.omit() %>% 
  distinct() %>%
  left_join(., ggseg.aparc, by=c("region"="ggseg_ROI")) %>%
  mutate(Braak = as.character(Braak)) %>%
  # ggseg: dk=Desikan-Killiany atlas, fill by Braak region, label by ROI name
  ggseg(atlas="dk", mapping=aes(fill=Braak, label=region)) +
  scale_fill_manual(values=c("#F8766D", "#A3A617", "#00BF7D",
                             "#00B0F6", "#E76BF3"), na.value="gray80") +
  theme(axis.text.y=element_blank(),
        axis.text.x = element_text(family="calibri"),
        axis.title.x = element_text(family="calibri"),
        legend.title=element_text(family="calibri"),
        legend.text=element_text(family="calibri"))
## Warning: Ignoring unknown aesthetics: label
# Convert to plotly interactive visualization
ggplotly(p.aparc, tooltip = c("fill", "label"), width=800, height=300)

# Plot the FreeSurfer subcortical segmentation atlas
p.aseg <- aseg %>% filter(hemi=="right") %>% 
  unnest(ggseg) %>% 
  select(region) %>% 
  na.omit() %>% 
  distinct() %>%
  left_join(., ggseg.aseg, by=c("region"="ggseg_ROI")) %>%
  mutate(Braak = as.character(Braak)) %>%
  # ggseg: aseg=subcortical segmentation atlas, fill by Braak region, label by ROI name
  ggseg(atlas="aseg", mapping=aes(fill=Braak, label=region)) +
  scale_fill_manual(values=c("deeppink", "#A3A617"), na.value="gray80") +
  theme(axis.text.y=element_blank(),
        axis.text.x = element_text(family="calibri"),
        axis.title.x = element_text(family="calibri"),
        legend.title=element_text(family="calibri"),
        legend.text=element_text(family="calibri"))
## Warning: Ignoring unknown aesthetics: label
# Convert to plotly interactive visualization
ggplotly(p.aseg, tooltip=c("fill", "label"), width=800, height=300)

I will filter the tau-PET dataset to only include SUVR data for ROIs detailed in the above list, by first reshaping the tau-PET SUVR data from wide to long. Then, I will merge left and right hemisphere ROIs into one bilateral ROI by taking the mean SUVR.

# ADNI tau-PET data includes other ROIs outside of this Braak stage dataset; for this study, I'm not looking at those
tau.stand.roi <- tau.stand %>%
  pivot_longer(., cols=c(-RID, -VISCODE2, -EXAMDATE), names_to="ROI_Name", values_to="SUVR") %>%
  mutate(ROI_Name=tolower(ROI_Name)) %>%
  # Only keep ROIs included in Braak stage stratifcation via semi-join
  semi_join(., roi.braak) %>%
  # Once dataset has been filtered, add columns containing Braak stage and cortical lobe
  left_join(., roi.braak) %>%
  # Remove right/left distinction from ROI name
  mutate(ROI_Name = str_replace_all(ROI_Name, "right_|left_|ctx_rh_|ctx_lh_", "")) %>%
  # Group by bilateral ROI -- e.g. "hippocampus" which contains left and right hippocampus
  dplyr::group_by(RID, VISCODE2, EXAMDATE, ROI_Name, Braak) %>%
  # Calculate ROI average SUVR
  dplyr::summarise(SUVR = mean(SUVR, na.rm=T))
## Joining, by = "ROI_Name"
## Joining, by = "ROI_Name"
## `summarise()` regrouping output by 'RID', 'VISCODE2', 'EXAMDATE', 'ROI_Name' (override with `.groups` argument)

This yields the following 31 distinct cortical ROIs:

data.frame(ROI=unique(tau.stand.roi$ROI_Name)) %>%
  left_join(., roi.braak, by=c("ROI"="Base_Name")) %>%
  select(-ROI_Name, -Hemisphere) %>%
  distinct() %>%
  datatable()

Final reshaping

Now, I will re-shape the tau-PET data back to wide to be compatible with the cognitive status data shape.

# Reshape wide --> long to be merged with CDR-SoB cognitive data
tau.stand.roi <- tau.stand.roi %>% 
  select(-Braak) %>%
  pivot_wider(id_cols=c(RID, VISCODE2, EXAMDATE), names_from="ROI_Name",
              values_from="SUVR")

str(tau.stand.roi)
## tibble [1,120 x 34] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ RID                     : int [1:1120] 21 31 31 56 56 56 59 69 69 69 ...
##  $ VISCODE2                : chr [1:1120] "m144" "m144" "m156" "m144" ...
##  $ EXAMDATE                : Date[1:1120], format: "2018-02-02" "2018-04-24" ...
##  $ amygdala                : num [1:1120] 1.875 3.407 11.684 9.215 0.718 ...
##  $ bankssts                : num [1:1120] 3.98 4.37 13.59 19.85 1.1 ...
##  $ caudalanteriorcingulate : num [1:1120] 4.539 3.86 12.853 10.023 0.627 ...
##  $ cuneus                  : num [1:1120] 5.15 5.068 15.032 12.777 0.606 ...
##  $ entorhinal              : num [1:1120] 3.579 2.209 15.515 10.671 0.719 ...
##  $ fusiform                : num [1:1120] 5.086 4.981 14.929 19.363 0.848 ...
##  $ hippocampus             : num [1:1120] 4.157 3.943 9.058 11.609 0.722 ...
##  $ inferiorparietal        : num [1:1120] 4.265 6.701 23.563 8.92 0.566 ...
##  $ inferiortemporal        : num [1:1120] 4.699 4.886 14.126 13.776 0.962 ...
##  $ insula                  : num [1:1120] 5.488 3.713 12.511 10.643 0.526 ...
##  $ isthmuscingulate        : num [1:1120] 5.086 4.271 12.47 14.818 0.525 ...
##  $ lateraloccipital        : num [1:1120] 7.416 6.346 23.661 17.155 0.853 ...
##  $ lingual                 : num [1:1120] 6.23 4.591 11.947 14.609 0.633 ...
##  $ middlefr                : num [1:1120] 6.311 3.499 16.865 7.813 0.486 ...
##  $ middletemporal          : num [1:1120] 5.895 4.684 14.276 8.208 0.697 ...
##  $ orbitofr                : num [1:1120] 6.545 4.629 16.458 12.738 0.698 ...
##  $ paracentral             : num [1:1120] 3.946 3.909 11.529 12.955 0.669 ...
##  $ parahippocampal         : num [1:1120] 5.06 2.76 16.62 12.15 0.61 ...
##  $ parsfr                  : num [1:1120] 6.1 4.1 15.54 14.23 0.62 ...
##  $ pericalcarine           : num [1:1120] 4.199 3.975 12.746 9.391 0.549 ...
##  $ postcentral             : num [1:1120] 5.544 3.378 14.07 14.106 0.648 ...
##  $ posteriorcingulate      : num [1:1120] 5.72 2.46 11.47 11.7 0.62 ...
##  $ precentral              : num [1:1120] 5.05 4.37 13.19 10.32 0.62 ...
##  $ precuneus               : num [1:1120] 5.285 1.738 9.159 10.021 0.936 ...
##  $ rostralanteriorcingulate: num [1:1120] 4.694 4.516 15.242 11.044 0.684 ...
##  $ superiorfrontal         : num [1:1120] 4.246 4.162 12.627 15.799 0.549 ...
##  $ superiorparietal        : num [1:1120] 8.289 5.042 20.393 13.962 0.486 ...
##  $ superiortemporal        : num [1:1120] 6.273 3.929 16.614 13.61 0.625 ...
##  $ supramarginal           : num [1:1120] 4.91 3.52 16.45 15.15 0.69 ...
##  $ temporalpole            : num [1:1120] 5.85 4.382 16.595 16.262 0.604 ...
##  $ transversetemporal      : num [1:1120] 3.456 2.565 13.554 9.961 0.458 ...
##  - attr(*, "groups")= tibble [1,120 x 4] (S3: tbl_df/tbl/data.frame)
##   ..$ RID     : int [1:1120] 21 31 31 56 56 56 59 69 69 69 ...
##   ..$ VISCODE2: chr [1:1120] "m144" "m144" "m156" "m144" ...
##   ..$ EXAMDATE: Date[1:1120], format: "2018-02-02" "2018-04-24" ...
##   ..$ .rows   : list<int> [1:1120] 
##   .. ..$ : int 1
##   .. ..$ : int 2
##   .. ..$ : int 3
##   .. ..$ : int 4
##   .. ..$ : int 5
##   .. ..$ : int 6
##   .. ..$ : int 7
##   .. ..$ : int 8
##   .. ..$ : int 9
##   .. ..$ : int 10
##   .. ..$ : int 11
##   .. ..$ : int 12
##   .. ..$ : int 13
##   .. ..$ : int 14
##   .. ..$ : int 15
##   .. ..$ : int 16
##   .. ..$ : int 17
##   .. ..$ : int 18
##   .. ..$ : int 19
##   .. ..$ : int 20
##   .. ..$ : int 21
##   .. ..$ : int 22
##   .. ..$ : int 23
##   .. ..$ : int 24
##   .. ..$ : int 25
##   .. ..$ : int 26
##   .. ..$ : int 27
##   .. ..$ : int 28
##   .. ..$ : int 29
##   .. ..$ : int 30
##   .. ..$ : int 31
##   .. ..$ : int 32
##   .. ..$ : int 33
##   .. ..$ : int 34
##   .. ..$ : int 35
##   .. ..$ : int 36
##   .. ..$ : int 37
##   .. ..$ : int 38
##   .. ..$ : int 39
##   .. ..$ : int 40
##   .. ..$ : int 41
##   .. ..$ : int 42
##   .. ..$ : int 43
##   .. ..$ : int 44
##   .. ..$ : int 45
##   .. ..$ : int 46
##   .. ..$ : int 47
##   .. ..$ : int 48
##   .. ..$ : int 49
##   .. ..$ : int 50
##   .. ..$ : int 51
##   .. ..$ : int 52
##   .. ..$ : int 53
##   .. ..$ : int 54
##   .. ..$ : int 55
##   .. ..$ : int 56
##   .. ..$ : int 57
##   .. ..$ : int 58
##   .. ..$ : int 59
##   .. ..$ : int 60
##   .. ..$ : int 61
##   .. ..$ : int 62
##   .. ..$ : int 63
##   .. ..$ : int 64
##   .. ..$ : int 65
##   .. ..$ : int 66
##   .. ..$ : int 67
##   .. ..$ : int 68
##   .. ..$ : int 69
##   .. ..$ : int 70
##   .. ..$ : int 71
##   .. ..$ : int 72
##   .. ..$ : int 73
##   .. ..$ : int 74
##   .. ..$ : int 75
##   .. ..$ : int 76
##   .. ..$ : int 77
##   .. ..$ : int 78
##   .. ..$ : int 79
##   .. ..$ : int 80
##   .. ..$ : int 81
##   .. ..$ : int 82
##   .. ..$ : int 83
##   .. ..$ : int 84
##   .. ..$ : int 85
##   .. ..$ : int 86
##   .. ..$ : int 87
##   .. ..$ : int 88
##   .. ..$ : int 89
##   .. ..$ : int 90
##   .. ..$ : int 91
##   .. ..$ : int 92
##   .. ..$ : int 93
##   .. ..$ : int 94
##   .. ..$ : int 95
##   .. ..$ : int 96
##   .. ..$ : int 97
##   .. ..$ : int 98
##   .. ..$ : int 99
##   .. .. [list output truncated]
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

Merging datasets

subj.info <- subj.info %>% select(RID, VISCODE, AGE, PTGENDER, CDRSB)

I actually can’t join the two datasets on the EXAMDATE feature, as these sometimes differ by one or two days depending on when the records were entered. Instead, I will join by the RID subject identifier and VISCODE, a visit code identifier.

# full.df = merged tau-PET data and cognitive assessment data
full.df <- inner_join(tau.stand.roi, subj.info, by=c("RID", "VISCODE2"="VISCODE"))  %>%
  filter(!is.na(CDRSB)) %>%
  group_by(RID) %>%
  dplyr::mutate(n_visits = n()) %>%
  # Only keep subjects with 2+ PET scans and cognitive assessments
  filter(n_visits>1) %>%
  select(-n_visits)

Here’s the structure of this merged dataset:

str(full.df)
## tibble [586 x 37] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ RID                     : int [1:586] 31 31 56 56 56 69 69 69 96 96 ...
##  $ VISCODE2                : chr [1:586] "m144" "m156" "m144" "m156" ...
##  $ EXAMDATE                : Date[1:586], format: "2018-04-24" "2019-04-23" ...
##  $ amygdala                : num [1:586] 3.407 11.684 9.215 0.718 1.209 ...
##  $ bankssts                : num [1:586] 4.37 13.59 19.85 1.1 1.23 ...
##  $ caudalanteriorcingulate : num [1:586] 3.86 12.853 10.023 0.627 0.869 ...
##  $ cuneus                  : num [1:586] 5.068 15.032 12.777 0.606 1.018 ...
##  $ entorhinal              : num [1:586] 2.209 15.515 10.671 0.719 1.143 ...
##  $ fusiform                : num [1:586] 4.981 14.929 19.363 0.848 0.672 ...
##  $ hippocampus             : num [1:586] 3.943 9.058 11.609 0.722 0.761 ...
##  $ inferiorparietal        : num [1:586] 6.701 23.563 8.92 0.566 1.161 ...
##  $ inferiortemporal        : num [1:586] 4.886 14.126 13.776 0.962 1.253 ...
##  $ insula                  : num [1:586] 3.713 12.511 10.643 0.526 0.958 ...
##  $ isthmuscingulate        : num [1:586] 4.271 12.47 14.818 0.525 0.686 ...
##  $ lateraloccipital        : num [1:586] 6.346 23.661 17.155 0.853 0.853 ...
##  $ lingual                 : num [1:586] 4.591 11.947 14.609 0.633 0.874 ...
##  $ middlefr                : num [1:586] 3.499 16.865 7.813 0.486 1.122 ...
##  $ middletemporal          : num [1:586] 4.684 14.276 8.208 0.697 1.369 ...
##  $ orbitofr                : num [1:586] 4.629 16.458 12.738 0.698 0.994 ...
##  $ paracentral             : num [1:586] 3.909 11.529 12.955 0.669 1.005 ...
##  $ parahippocampal         : num [1:586] 2.76 16.62 12.15 0.61 1.05 ...
##  $ parsfr                  : num [1:586] 4.102 15.542 14.228 0.62 0.938 ...
##  $ pericalcarine           : num [1:586] 3.975 12.746 9.391 0.549 0.884 ...
##  $ postcentral             : num [1:586] 3.378 14.07 14.106 0.648 1.001 ...
##  $ posteriorcingulate      : num [1:586] 2.463 11.472 11.696 0.62 0.867 ...
##  $ precentral              : num [1:586] 4.371 13.185 10.322 0.62 0.652 ...
##  $ precuneus               : num [1:586] 1.738 9.159 10.021 0.936 1.223 ...
##  $ rostralanteriorcingulate: num [1:586] 4.516 15.242 11.044 0.684 0.791 ...
##  $ superiorfrontal         : num [1:586] 4.162 12.627 15.799 0.549 0.882 ...
##  $ superiorparietal        : num [1:586] 5.042 20.393 13.962 0.486 0.935 ...
##  $ superiortemporal        : num [1:586] 3.929 16.614 13.61 0.625 0.706 ...
##  $ supramarginal           : num [1:586] 3.52 16.45 15.15 0.69 1.15 ...
##  $ temporalpole            : num [1:586] 4.382 16.595 16.262 0.604 0.871 ...
##  $ transversetemporal      : num [1:586] 2.565 13.554 9.961 0.458 0.847 ...
##  $ AGE                     : num [1:586] 56.6 80.6 81.3 71.2 63.4 ...
##  $ PTGENDER                : chr [1:586] "Male" "Male" "Male" "Male" ...
##  $ CDRSB                   : num [1:586] 0.221 0.069 -1.222 3.439 4.209 ...
##  - attr(*, "groups")= tibble [247 x 2] (S3: tbl_df/tbl/data.frame)
##   ..$ RID  : int [1:247] 31 56 69 96 112 377 416 467 618 668 ...
##   ..$ .rows: list<int> [1:247] 
##   .. ..$ : int [1:2] 1 2
##   .. ..$ : int [1:3] 3 4 5
##   .. ..$ : int [1:3] 6 7 8
##   .. ..$ : int [1:3] 9 10 11
##   .. ..$ : int [1:3] 12 13 14
##   .. ..$ : int [1:3] 15 16 17
##   .. ..$ : int [1:2] 18 19
##   .. ..$ : int [1:2] 20 21
##   .. ..$ : int [1:2] 22 23
##   .. ..$ : int [1:2] 24 25
##   .. ..$ : int [1:2] 26 27
##   .. ..$ : int [1:2] 28 29
##   .. ..$ : int [1:3] 30 31 32
##   .. ..$ : int [1:2] 33 34
##   .. ..$ : int [1:3] 35 36 37
##   .. ..$ : int [1:2] 38 39
##   .. ..$ : int [1:4] 40 41 42 43
##   .. ..$ : int [1:2] 44 45
##   .. ..$ : int [1:2] 46 47
##   .. ..$ : int [1:3] 48 49 50
##   .. ..$ : int [1:3] 51 52 53
##   .. ..$ : int [1:2] 54 55
##   .. ..$ : int [1:3] 56 57 58
##   .. ..$ : int [1:2] 59 60
##   .. ..$ : int [1:2] 61 62
##   .. ..$ : int [1:5] 63 64 65 66 67
##   .. ..$ : int [1:3] 68 69 70
##   .. ..$ : int [1:2] 71 72
##   .. ..$ : int [1:4] 73 74 75 76
##   .. ..$ : int [1:2] 77 78
##   .. ..$ : int [1:4] 79 80 81 82
##   .. ..$ : int [1:3] 83 84 85
##   .. ..$ : int [1:2] 86 87
##   .. ..$ : int [1:2] 88 89
##   .. ..$ : int [1:3] 90 91 92
##   .. ..$ : int [1:2] 93 94
##   .. ..$ : int [1:2] 95 96
##   .. ..$ : int [1:3] 97 98 99
##   .. ..$ : int [1:2] 100 101
##   .. ..$ : int [1:4] 102 103 104 105
##   .. ..$ : int [1:2] 106 107
##   .. ..$ : int [1:2] 108 109
##   .. ..$ : int [1:2] 110 111
##   .. ..$ : int [1:2] 112 113
##   .. ..$ : int [1:3] 114 115 116
##   .. ..$ : int [1:3] 117 118 119
##   .. ..$ : int [1:4] 120 121 122 123
##   .. ..$ : int [1:2] 124 125
##   .. ..$ : int [1:2] 126 127
##   .. ..$ : int [1:2] 128 129
##   .. ..$ : int [1:2] 130 131
##   .. ..$ : int [1:3] 132 133 134
##   .. ..$ : int [1:3] 135 136 137
##   .. ..$ : int [1:2] 138 139
##   .. ..$ : int [1:2] 140 141
##   .. ..$ : int [1:2] 142 143
##   .. ..$ : int [1:2] 144 145
##   .. ..$ : int [1:2] 146 147
##   .. ..$ : int [1:3] 148 149 150
##   .. ..$ : int [1:2] 151 152
##   .. ..$ : int [1:2] 153 154
##   .. ..$ : int [1:3] 155 156 157
##   .. ..$ : int [1:4] 158 159 160 161
##   .. ..$ : int [1:2] 162 163
##   .. ..$ : int [1:2] 164 165
##   .. ..$ : int [1:2] 166 167
##   .. ..$ : int [1:3] 168 169 170
##   .. ..$ : int [1:2] 171 172
##   .. ..$ : int [1:3] 173 174 175
##   .. ..$ : int [1:2] 176 177
##   .. ..$ : int [1:2] 178 179
##   .. ..$ : int [1:3] 180 181 182
##   .. ..$ : int [1:2] 183 184
##   .. ..$ : int [1:2] 185 186
##   .. ..$ : int [1:3] 187 188 189
##   .. ..$ : int [1:2] 190 191
##   .. ..$ : int [1:3] 192 193 194
##   .. ..$ : int [1:3] 195 196 197
##   .. ..$ : int [1:2] 198 199
##   .. ..$ : int [1:3] 200 201 202
##   .. ..$ : int [1:4] 203 204 205 206
##   .. ..$ : int [1:2] 207 208
##   .. ..$ : int [1:2] 209 210
##   .. ..$ : int [1:2] 211 212
##   .. ..$ : int [1:2] 213 214
##   .. ..$ : int [1:2] 215 216
##   .. ..$ : int [1:2] 217 218
##   .. ..$ : int [1:2] 219 220
##   .. ..$ : int [1:2] 221 222
##   .. ..$ : int [1:2] 223 224
##   .. ..$ : int [1:4] 225 226 227 228
##   .. ..$ : int [1:2] 229 230
##   .. ..$ : int [1:4] 231 232 233 234
##   .. ..$ : int [1:3] 235 236 237
##   .. ..$ : int [1:2] 238 239
##   .. ..$ : int [1:3] 240 241 242
##   .. ..$ : int [1:3] 243 244 245
##   .. ..$ : int [1:2] 246 247
##   .. ..$ : int [1:2] 248 249
##   .. .. [list output truncated]
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE
cat("\nNumber of longitudinal tau-PET scans with accompanying cognitive data: **\n",
    nrow(full.df), "**\nNumber of subjects in merged dataset: **", 
    length(unique(full.df$RID)), "**\n", "\n", sep="")

Number of longitudinal tau-PET scans with accompanying cognitive data: 586 Number of subjects in merged dataset: 247

As it turns out, only 586 of the original 593 tau-PET scans had corresponding cognitive assessments. This leaves 586 unique PET scan datapoints for 247 subjects.

Rate of change calculation

Lastly, before I can perform outlier detection, I need to derive the longitudinal features upon which the prediction models will be built – namely, annual change in tau-PET SUVR and annual change in CDR-Sum of Boxes score.

# Calculate change in tau-PET SUVR and CDR-SoB over time, normalized to # of years elapsed
annual.changes <- full.df %>%
  ungroup() %>%
  select(-VISCODE2) %>%
  # Reshape wide --> long
  pivot_longer(cols=c(-RID, -EXAMDATE, -PTGENDER, -AGE), names_to="Metric",
               values_to="Value") %>%
  # Calculate number of years between each visit as well as change in SUVR or CDR-SoB
  dplyr::group_by(RID, PTGENDER, Metric) %>%
  dplyr::summarise(n_years = as.numeric((EXAMDATE - lag(EXAMDATE, 
                                                        default=EXAMDATE[1]))/365),
                   change = Value - lag(Value, default=Value[1]),
                   Age_Baseline = AGE[n_years==0]) %>%
  # Remove data points corresponding to first visit
  filter(n_years > 0) %>%
  # Calculate change in either tau-PET SUVR or CDR-SoB change per year
  dplyr::mutate(Annual_Change = change/n_years) %>%
  # Remove columns that are no longer needed
  select(-n_years, -change) %>%
  group_by(RID, Metric) %>%
  # Assign row identifier with row_number()
  dplyr::mutate(interval_num = row_number()) %>%
  # Reshape from long --> wide
  pivot_wider(., id_cols=c(RID, Age_Baseline, PTGENDER, interval_num), names_from=Metric,
              values_from=Annual_Change) %>%
  rename("Sex" = "PTGENDER")
## `summarise()` regrouping output by 'RID', 'PTGENDER', 'Metric' (override with `.groups` argument)
datatable(annual.changes[1:5])

Outlier detection

Now that the datasets are merged, I can perform outlier detection. Given the multivariate nature of this dataset (i.e. multiple brain regions), I will use Cook’s Distance to estimate the relative influence of each data point in a simple multiple regression model.

# Calculate Cook's Distance using multiple regression of CDR-SoB annual change on tau-PET SUVR change for all 31 ROIs
cooks.distance <- cooks.distance(lm(CDRSB ~ . - RID - interval_num, data=annual.changes))

# Plot Cook's distance for each subject
p.cooks <- data.frame(CD=cooks.distance) %>%
  rownames_to_column(var="Data_Point") %>%
  mutate(Data_Point=as.numeric(Data_Point)) %>%
  mutate(Label=ifelse(CD>30, Data_Point, NA_real_)) %>%
  ggplot(data=., mapping=aes(x=Data_Point,y=CD)) +
  geom_hline(yintercept = 4*mean(cooks.distance,na.rm=T), color="blue") +
  geom_point() +
  geom_text(aes(label=Label), nudge_y=1.5) +
  ylab("Cook's Distance") +
  xlab("annual.changes index") +
  theme_minimal()

# Convert to interactive plotly visualization
ggplotly(p.cooks)
rm(p.cooks)

All but one data point have relatively low Cook’s distance values, while data point #224 has a relatively large Cook’s distance. This suggests large residuals and leverage associated with this datapoint, which could distort model fitting and accuracy. Upon further examination of this instance:

# Show data from row 224
as.data.frame(t(annual.changes[224,])) %>%
  rownames_to_column(var="Variable") %>%
  dplyr::rename("Value" = "V1") %>%
  datatable()
## Warning: The `i` argument of ``[.tbl_df`()` must lie in [0, rows] if positive, as of tibble 3.0.0.
## Use `NA_integer_` as row index to obtain a row full of `NA` values.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

This subject exhibits very large fluctuations in tau-PET SUVR values in several brain regions for this associated time interval. Given that SUVR values typically range from 0.75-2, changes of this large magnitude is surprising, and may certainly render this data point an outlier. Fortunately, the interval_num of 2 indicates that this is the second time interval for this subject, so omitting this interval doesn’t reduce the total number of subjects in the analysis. I will remove this data point:

# Omit row 224
annual.changes <- annual.changes[-224,]
## Warning: The `i` argument of ``[.tbl_df`()` must lie in [-rows, 0] if negative, as of tibble 3.0.0.
## Use `NA_integer_` as row index to obtain a row full of `NA` values.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Tau-PET annual change across ROIs

I can now finish some aspects of data exploration that depended upon refining the subject cohort as well as the features. For starters, I will examine the distribution of annual tau change in each of the 31 ROIs:

# reshape tau-PET SUVR change ROI-wise data from wide --> long
annual.changes.tau <- annual.changes %>%
  select(-CDRSB) %>%
  ungroup() %>%
  pivot_longer(cols=c(-RID, -interval_num, -Age_Baseline, -Sex), names_to="ROI",
               values_to="deltaSUVR")

# Plot SUVR change distribution for each ROI using multi.hist from psych package
multi.hist(annual.changes %>% ungroup() %>% select(-RID, -interval_num, -CDRSB, -Age_Baseline, -Sex),
           dcol="red")

The distribution looks reasonably normal for each ROI, and all of the curves peak around zero, suggesting all of the ROIs have a mean of ~0. Since there are both negative values and values of zero in these data, neither log nor square root transformation would be possible, anyway. Therefore, I will leave the variable distribution as-is.

Next, I will visualize the correlation in annual tau change between each of the ROIs measured:

# Select only tau-PET ROIs from annual change dataset
annual.roi <- annual.changes %>% ungroup() %>% select(-RID, -interval_num, -CDRSB, -Age_Baseline, -Sex)
# Calculate correlation matrix between all ROIs
roi.cor <- cor(annual.roi)

# Plot the correlation matrix using ggcorrplot, order by hierarchical clustering
ggcorrplot(roi.cor, hc.order = TRUE, outline.col = "white") %>% 
  # Convert to interactive correlogram with plotly
  ggplotly(width=800, height=700)

As it turns out, all ROIs show positive correlations in the annual rate of change in tau-PET uptake, with the exception of three ROI pairs:

  • entorhinal and pericalcarine (R=-0.14)
  • entorhinal and transverse temporal (R=-0.07)
  • transverse temporal and lateral occipital (R=-0.07)

These are very weak correlations, and can be further visualized with scatter plots:

# Highlight the ROIs noted above with (slightly) negative correlations
# Use ggpairs function from GGally to visualize their pairwise distributions
p.select.rois <- ggpairs(annual.roi %>% select(entorhinal, pericalcarine, 
                                               transversetemporal,
                                               lateraloccipital),
                         # Add regression line to scatter plots
                         lower = list(continuous = wrap("smooth", se=F, 
                                                        method = "lm", 
                                                        color="lightslategray",
                                                        alpha=0.4))) +
  theme_minimal()

# Convert to interactive pair plot with plotly
ggplotly(p.select.rois, width=700, height=600)
## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?

## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?

## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?

## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?

## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?

## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?

## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?
## Warning: Can only have one: highlight
## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?

## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?
## Warning: Can only have one: highlight
## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?
## Warning: Can only have one: highlight

These negative correlations are indeed weak and mostly noise, based on the scatter plots. Regarding the other positive correlations, I am curious as to whether there are underlying trends based on either spatial proximity and/or tau progression in AD, based on cortical lobe and Braak regions, respectively:

# Convert correlation matrix from wide --> long, each row will detail one pairwise correlation
roi.cor.long <-  as.data.frame(roi.cor) %>%
  rownames_to_column(var="ROI1") %>%
  pivot_longer(cols=c(-ROI1), names_to="ROI2", values_to="Pearson_Corr") %>%
  # Exclude rows where the two ROIs are the same, where correlation is always 1
  filter(ROI1 != ROI2) %>%
  # Join with Braak-stage stratification dataframe by the first ROI column
  left_join(., roi.braak, by=c("ROI1"="Base_Name")) %>%
  select(-ROI_Name, -Hemisphere) %>%
  # Specify that Braak + Cortex info pertain to the first ROI column, not the second
  dplyr::rename("ROI1_Braak" = "Braak", "ROI1_Cortex" = "Cortex") %>%
  # Merge again with Braak-stage stratification, this time by the second ROI column
  left_join(., roi.braak, by=c("ROI2" = "Base_Name")) %>%
  select(-ROI_Name, -Hemisphere) %>%
  dplyr::rename("ROI2_Braak" = "Braak", "ROI2_Cortex" = "Cortex") %>%
  # Rename Insula as Ins for visualization purpose
  mutate_at(c("ROI1_Cortex", "ROI2_Cortex"), function(x) ifelse(x=="Insula", "Ins", x))

# Plot the correlation by Cortical Lobe
p.cor.cortical <- roi.cor.long %>%
  ggplot(., mapping=aes(x=ROI1, y=ROI2)) +
  # Heatmap, fill squares by pearson correlation coefficient
  geom_tile(mapping=aes(fill=Pearson_Corr)) +
  labs(fill="Pearson Coefficient") +
  theme_minimal() +
  # Facet by cortical lobes
  facet_grid(ROI2_Cortex ~ ROI1_Cortex, scales="free", 
             space="free", switch="both") +
  ggtitle("Correlation in Annual Tau SUVR Change by Cortical Lobe") +
  theme(axis.title=element_blank(),
        axis.text.y = element_text(size=11),
        axis.text.x = element_text(angle=90, size=11, hjust=1),
        panel.border = element_blank(), 
        panel.grid = element_blank(),
        plot.title=element_text(hjust=0.5)) +
  theme(strip.placement = "outside") +
  scale_fill_gradient2(low="#210DFC", mid="white", high="#FF171B",
                       limits=c(-1,1))

# Save correlation matrix to PNG and print -- dimensions were bizarre if I plotted directly from ggplot
ggsave("ROI_Correlation_Cortical.png", plot=p.cor.cortical, width=9, height=7.5, units="in", dpi=300)
include_graphics("ROI_Correlation_Cortical.png")

There are somewhat stronger inter-correlations within the frontal and parietal cortices compared with other cortical lobes. Now stratifying based on ROI Braak stage:

# Plot the correlation by Braak Stage
p.cor.braak <- roi.cor.long %>%
  ggplot(., mapping=aes(x=ROI1, y=ROI2)) +
  # Heatmap, fill squares by pearson correlation coefficient
  geom_tile(mapping=aes(fill=Pearson_Corr)) +
  labs(fill="Pearson Coefficient") +
  theme_minimal() +
  # Facet by Braak stage
  facet_grid(ROI2_Braak ~ ROI1_Braak, scales="free", 
             space="free", switch="both") +
  ggtitle("Correlation in Annual Tau SUVR Change by Braak Stage") +
  theme(axis.title=element_blank(),
        axis.text.y = element_text(size=11),
        axis.text.x = element_text(angle=90, size=11, hjust=1),
        panel.border = element_blank(), 
        panel.grid = element_blank()) +
  theme(strip.placement = "outside") +
  scale_fill_gradient2(low="#210DFC", mid="white", high="#FF171B",
                       limits=c(-1,1))

# Save correlation matrix to PNG and print -- dimensions were bizarre if I plotted directly from ggplot
ggsave("ROI_Correlation_Braak.png", plot=p.cor.braak, width=9, height=7.5, units="in", dpi=300)
include_graphics("ROI_Correlation_Braak.png")

 

This generally high correlation in annual tau-PET SUVR changes between cortical regions may pose a challenge when it comes to modeling, due to feature collinearity.

Principal component analysis (PCA)

While I want to keep each region distinct for biological context, I will also reduce the dimensionality of the data using principal component analysis (PCA), which has an added benefit of yielding orthogonal un-correlated components to serve as input for the modeling phase. Since all the variables are in the same unit (i.e. change in SUVR per year), I will only need to center the data, not scale it.

# Convert dataframe to numerical matrix
pca.df <- as.matrix(annual.changes %>% ungroup() %>% select(-RID, -CDRSB, -interval_num, -Age_Baseline, -Sex))

# Perform PCA with prcomp() from stats package
# Center data but don't scale, as all columns are in same units (change in SUVR per year)
res.pca <- prcomp(pca.df, center=T, scale.=F)

# The variable info can be extracted as follows:
var <- get_pca_var(res.pca)

The proportion of variance explained by each principal component (PC) can be visualized using a Scree plot:

# Calculate cumulative proportion of variance explained
cumpro <- cumsum(res.pca$sdev^2 / sum(res.pca$sdev^2)*100)
# Calculate individual proportion of variance explained by each principal component
variances <- data.frame((res.pca$sdev^2/sum(res.pca$sdev^2))*100)
# Label PCs
variances$PC <- c(1:31)
# Add in cumulative proportion of variance to dataframe
variances$cumpro <- cumpro
# Establish column names
colnames(variances) <- c("Variance_Proportion", "PC", "CumVar")

# Create a new row just for visualization purpose, helps with axis structure
newrow <- subset(variances, PC == 31)
newrow$PC <- 31.5
variances <- plyr::rbind.fill(variances, newrow)

# Set individual variance line as maroon, cumulative variance line as green
linecolors <- c("Component Variance" = "maroon4",
                "Cumulative Variance" = "green4")

# Plot the customized Scree plot
p.var <- variances %>%
  ggplot(data=.) +
  # Add a bar for each principal component, showing individual proportion of variance
  geom_bar(data=subset(variances, PC < 32), mapping=aes(x=PC, y=Variance_Proportion),
           stat="identity", fill="steelblue") +
  # Add line for individual component variance
  geom_line(aes(color="Component Variance", x=PC, y=Variance_Proportion), 
            size=0.7, data=subset(variances, PC < 31.1), show.legend=F) +
  # Add line for cumulative variance explained with each additional principal component
  geom_line(aes(x=PC, y=CumVar, color="Cumulative Variance"), size=0.7,
            data=subset(variances, PC < 32)) +
  geom_point(aes(x=PC, y=Variance_Proportion),data=subset(variances, PC < 31.1), size=1.5) +
  # Set line colors as defined above
  scale_colour_manual(name="",values=linecolors, 
                      guide = guide_legend(override.aes=list(size=2))) + 
  theme_minimal() +
  ylab("Percentage of Variance Explained") +
  xlab("Principal Component") +
  ggtitle("Principal Components\nContribution to Subject Variance") +
  # Manually define x-axis and y-axis limits so line aligns with bar plot
  xlim(c(0.5, 31.5)) +
  scale_y_continuous(breaks=seq(0, 100, 10),
                     sec.axis = dup_axis(name="")) +
  theme(plot.title = element_text(hjust=0.5, size=14),
        axis.text = element_text(size=12),
        panel.grid = element_blank(),
        legend.position="bottom",
        legend.text = element_text(size=12))

# Convert to plotly interactive visualization
ggplotly(p.var, tooltip=c("x","y")) %>% 
  layout(legend = list(orientation = "h", y=-0.2))

The first five principal components (PCs) collectively explain 77.2% of variance in the data; beyond these components, there are only marginal increases in the cumulative variance explained. Therefore, I will move forward with these first five PCs.

Individual ROI contributions (loadings) per component can be extracted:

# variable loadings are stored in the $rotation vector of prcomp output
loadings_wide <- data.frame(res.pca$rotation) %>%
  # add rownames as column
  cbind(ROI=rownames(.), .) %>%
  # remove original row names
  remove_rownames() %>% 
  # Select ROI and first five PCs
  select(ROI:PC5) %>%
  rowwise() %>%
  # Join with Braak stage stratification dataframe
  left_join(., roi.braak, by=c("ROI"="Base_Name")) %>%
  select(ROI, Cortex, Braak, PC1:PC5) %>%
  distinct()

# Print interactive datatable
datatable(loadings_wide %>% mutate_if(is.numeric, function(x) round(x,4)))

I’m curious as to whether ROIs exhibit similar covariance in annual tau-PET changes based on spatial proximity (i.e. cortical region) and/or similar Alzheimer’s Disease progression (i.e. Braak stage).

# Plot loadings colored by cortical lobe
p.cortex <- loadings_wide %>%
  ggplot(data=., mapping=aes(x=PC1, y=PC2, label=ROI)) +
  geom_hline(yintercept=0, linetype=2, alpha=0.5) +
  geom_vline(xintercept=0, linetype=2, alpha=0.5) +
  geom_point(aes(color=Cortex), size=3) +
  theme_minimal() +
  xlab("PC1 (41.6% Variance)") +
  ylab("PC2 (14.0% Variance)") +
  ggtitle("ROI PC Loadings by Cortical Region") +
  theme(plot.title=element_text(hjust=0.5))

# Convert to interactive scatterplot with plotly
ggplotly(p.cortex, tooltip=c("label", "x", "y"))
rm(p.cortex)

The first note is that all of the ROIs exhibit a negative loading (correlation) with PC1. Beyond that, all of the occipital and parietal cortex ROIs are positively correlated with PC2, while the insula, temporal cortex, and cingulate cortex ROIs are all negatively correlated with PC2. The frontal cortex ROIs are right on the border of PC2, low correlations in both directions.

# Plot PC loadings colored by Braak stage
p.braak <- loadings_wide %>%
  ggplot(data=., mapping=aes(x=PC1, y=PC2, label=ROI)) +
  geom_hline(yintercept=0, linetype=2, alpha=0.5) +
  geom_vline(xintercept=0, linetype=2, alpha=0.5) +
  geom_point(aes(color=Braak), size=3) +
  theme_minimal() +
  xlab("PC1 (41.6% Variance)") +
  ylab("PC2 (14.0% Variance)") +
  ggtitle("ROI PC Loadings by Braak Stage") +
  theme(plot.title=element_text(hjust=0.5))

# Convert to interactive scatterplot with plotly
ggplotly(p.braak, tooltip=c("label", "x", "y"))
rm(p.braak)

There is not as clear a distinction to be made based on ROI Braak stage. One observation that does stand out is that all of the Braak VI ROIs are relatively close in the upper right of the points. Beyond that, the Braak stages are mixed in this loading plot.

Moving on, the subject and time interval info can be linked with the PCA results:

# post.pca = subject identification info + first five PCs
post.pca <- as.data.frame(res.pca$x[,1:5]) %>%
  cbind(., RID=annual.changes$RID) %>%
  cbind(., interval_num=annual.changes$interval_num) %>%
  cbind(., CDRSB=annual.changes$CDRSB) %>%
  cbind(., Age_Baseline=annual.changes$Age_Baseline) %>%
  cbind(., Sex=annual.changes$Sex) %>%
  select(RID, interval_num, Age_Baseline, Sex, CDRSB, PC1:PC5)

# print interactive datatable
datatable(post.pca %>% mutate_if(is.numeric, function(x) round(x,5)) %>% select(-Age_Baseline, -Sex))

Dummy variable encoding

The final step is to convert sex into a dummy variable for modeling:

# Encode sex as a binary variable for Sex_Male
annual.changes$Sex_Male <- ifelse(annual.changes$Sex=="Male", 1, 0)
post.pca$Sex_Male <- ifelse(post.pca$Sex=="Male", 1, 0)

# Remove original Sex feature
annual.changes <- annual.changes %>% select(-Sex)
post.pca <- post.pca %>% select(-Sex)

Phase Four: Modeling (Original Data)

Data split for train + test

As per standard convention in model development, I will randomly partition this dataset into training and testing subsets, such that the models are trained and evaluated in independent data subsets. I will use the sample function with a specific seed set (127) to partition the data into 75% training and 25% testing.

# Set seed for consistency in random sampling for 10-foldcross-validation
set.seed(127)
train.index <- sample(nrow(annual.changes), nrow(annual.changes)*0.75, replace=F)

# Remove unneccessary identifier info from datasets for modeling
original <- annual.changes %>% ungroup() %>% select(-RID, -interval_num)

# Pre-processing will be applied in model training with caret

# Subset training + test data for original (ROI) data
original.train <- original[train.index, ]
original.test <- original[-train.index, ]

Individual model training and tuning

I will be using the caretEnsemble package to compile four individual regression models into a stacked ensemble. This package enables evaluation of the models individually as well as together in the ensemble.

The first step is to create a caretList of the four regression models I will use:

  • Elastic net regression (glmnet)
  • k-nearest neighbors regression (knn)
  • Neural network regression (nnet)
  • Random forest regression (ranger)

I will use the trainControl function to specify ten-fold cross-validation with parallel processing.

# trainControl for 10-fold CV and parallel processing
ensemble.control <- trainControl(method="cv", number=10, allowParallel=T)

I’m using the RMSE as the metric according to which model parameters are tuned. All input data (which includes training and test data) will be automatically standardized using the preProcess center + scaling feature.

# Set seed for consistency
set.seed(127)

# Neural net -- linout=T means linear output (i.e. not constrained to be [0,1]),
# try a range of hidden layer sizes and weight decays
my.neuralnet <- caretModelSpec(method="nnet", linout=T, trace=F, 
                               tuneGrid = expand.grid(size=c(1, 3, 5, 10, 15), 
                                                      decay = seq(0, 1, by=0.2)))

# k-nearest neighbors: try 20 different values of k
my.knn <- caretModelSpec(method="knn", tuneLength=20)

# random forest: try 15 different numbers of features considered at each node and use 500 sampled trees
my.randomforest <- caretModelSpec(method="ranger", tuneLength=15, num.trees=500, importance="permutation")

# elastic net: try four different values of alpha for ridge/lasso blending and four lambda values for coefficient penalty
my.elasticnet <- caretModelSpec(method="glmnet",
                                tuneGrid=expand.grid(alpha=c(0,0.1,0.6,1), lambda=c(5^-5,5^-3,5^-1,1)))

# Compile individual models into one cohesive model list using caretList
invisible(capture.output(ensemble.models <- caretList(CDRSB ~ ., 
                                                      data=original.train, 
                                                      trControl=ensemble.control, 
                                                      metric="RMSE", 
                                                      preProcess=c("center", "scale"),
                                                      tuneList=list(my.neuralnet, 
                                                                    my.knn, 
                                                                    my.randomforest, 
                                                                    my.elasticnet))))
## Warning in trControlCheck(x = trControl, y = target): trControl$savePredictions
## not 'all' or 'final'. Setting to 'final' so we can ensemble the models.
## Warning in trControlCheck(x = trControl, y = target): indexes not defined in
## trControl. Attempting to set them ourselves, so each model in the ensemble will
## have the same resampling indexes.
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

The final chosen parameters for each model can be viewed:

Elastic net final model

# Print elastic net model summary
ensemble.models$glmnet

# Elastic net cross-validation results
glmnet.alpha <- ensemble.models$glmnet$results$alpha
glmnet.rmse <- ensemble.models$glmnet$results$RMSE
glmnet.mae <- ensemble.models$glmnet$results$MAE
glmnet.lambda <- ensemble.models$glmnet$results$lambda

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.glmnet.cv <- data.frame(alpha=glmnet.alpha, RMSE=glmnet.rmse, MAE=glmnet.mae, lambda=glmnet.lambda) %>%
  mutate(alpha=as.character(alpha)) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=lambda, y=Value, color=alpha)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=alpha)) +
  theme_minimal() +
  ggtitle("Elastic Net Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.glmnet.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "lambda", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "lambda", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
# Clear workspace
rm(glmnet.alpha, glmnet.rmse, glmnet.lambda, glmnet.mae, p.glmnet.cv, train.index)
## glmnet 
## 
## 146 samples
##  33 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 131, 132, 133, 131, 130, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda   RMSE      Rsquared    MAE     
##   0.0    0.00032  3.445119  0.09542003  2.808984
##   0.0    0.00800  3.445119  0.09542003  2.808984
##   0.0    0.20000  3.357107  0.08443950  2.728564
##   0.0    1.00000  3.347762  0.09252242  2.722655
##   0.1    0.00032  4.333794  0.08133285  3.441556
##   0.1    0.00800  3.939633  0.08302928  3.148241
##   0.1    0.20000  3.360567  0.10366488  2.734317
##   0.1    1.00000  3.344065  0.09973986  2.708923
##   0.6    0.00032  4.402620  0.08683575  3.485667
##   0.6    0.00800  3.852061  0.08090841  3.110079
##   0.6    0.20000  3.352485  0.10379507  2.720541
##   0.6    1.00000  3.319450  0.11334288  2.688184
##   1.0    0.00032  4.420295  0.08681546  3.500900
##   1.0    0.00800  3.845411  0.06883003  3.112601
##   1.0    0.20000  3.329798  0.12073187  2.697525
##   1.0    1.00000  3.341064         NaN  2.706431
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.6 and lambda = 1.

 

kNN final model

# Print kNN model summary
ensemble.models$knn

# kNN cross-validation plot
knn.k <- ensemble.models$knn$results$k
knn.rmse <- ensemble.models$knn$results$RMSE
knn.mae <- ensemble.models$knn$results$MAE

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.knn.cv <- data.frame(k=knn.k, RMSE=knn.rmse, MAE=knn.mae) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=k, y=Value, color=Metric)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line() +
  theme_minimal() +
  ggtitle("kNN Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.knn.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "k", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "k", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
rm(knn.rmse, knn.k, p.knn.cv, knn.mae)
## k-Nearest Neighbors 
## 
## 146 samples
##  33 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 131, 132, 133, 131, 130, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared    MAE     
##    5  3.651126  0.07360204  2.953790
##    7  3.561344  0.02384452  2.897339
##    9  3.477898  0.03133779  2.783562
##   11  3.455537  0.04551948  2.779588
##   13  3.415426  0.07647415  2.748112
##   15  3.410075  0.06093929  2.752946
##   17  3.418949  0.05988781  2.751391
##   19  3.375696  0.06947185  2.726048
##   21  3.359218  0.09112963  2.719784
##   23  3.359253  0.07502934  2.732045
##   25  3.361499  0.09230724  2.736707
##   27  3.345560  0.10109715  2.723787
##   29  3.333740  0.09090090  2.708902
##   31  3.336070  0.11594567  2.706367
##   33  3.328221  0.09603620  2.703790
##   35  3.319700  0.08958625  2.700915
##   37  3.313724  0.09034499  2.689110
##   39  3.309776  0.08587716  2.688118
##   41  3.307853  0.09405924  2.687463
##   43  3.313425  0.09786186  2.689286
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 41.

 

Neural network final model

# Print neural network model summary
ensemble.models$nnet

# Neural network cross-validation plot
n.neurons <- ensemble.models$nnet$results$size
nnet.rmse <- ensemble.models$nnet$results$RMSE
nnet.mae <- ensemble.models$nnet$results$MAE
nnet.weight <- ensemble.models$nnet$results$decay

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.nnet.cv <- data.frame(n.neurons, RMSE=nnet.rmse, MAE=nnet.mae, decay=nnet.weight) %>%
  mutate(decay=as.character(decay)) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=n.neurons, y=Value, color=decay)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=decay)) +
  theme_minimal() +
  ggtitle("Neural Network Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.nnet.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "# Neurons in Hidden Layer", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "# Neurons in Hidden Layer", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
rm(n.neurons, nnet.rmse, nnet.mae, nnet.weight, p.nnet.cv)
## Neural Network 
## 
## 146 samples
##  33 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 131, 132, 133, 131, 130, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared    MAE     
##    1    0.0    3.634117  0.09930884  2.976800
##    1    0.2    3.576740  0.09283196  2.886103
##    1    0.4    3.745677  0.08588792  3.013432
##    1    0.6    3.721251  0.06975126  3.033182
##    1    0.8    3.641049  0.06573590  2.971811
##    1    1.0    3.575166  0.06801171  2.898438
##    3    0.0    4.438391  0.05822184  3.553142
##    3    0.2    4.645276  0.05324361  3.580507
##    3    0.4    4.262647  0.05537087  3.423523
##    3    0.6    4.012297  0.04263429  3.227800
##    3    0.8    3.933914  0.05476791  3.167765
##    3    1.0    3.947122  0.10163622  3.131860
##    5    0.0    4.868588  0.06839652  3.976531
##    5    0.2    4.945099  0.06154137  3.789715
##    5    0.4    4.754286  0.08275376  3.702982
##    5    0.6    4.238883  0.03719622  3.358227
##    5    0.8    4.003123  0.11890069  3.122017
##    5    1.0    4.045693  0.08848388  3.222295
##   10    0.0    6.157697  0.12364201  4.805280
##   10    0.2    4.900184  0.11266349  3.820657
##   10    0.4    4.385394  0.09933619  3.558815
##   10    0.6    4.171166  0.14411854  3.428269
##   10    0.8    3.995834  0.10827539  3.213105
##   10    1.0    4.055811  0.07766275  3.276869
##   15    0.0    6.630830  0.05058592  4.790808
##   15    0.2    4.755751  0.06146427  3.815271
##   15    0.4    4.339336  0.10192773  3.475802
##   15    0.6    4.128593  0.11101032  3.318602
##   15    0.8    4.090662  0.10161286  3.243388
##   15    1.0    3.962751  0.07251835  3.133097
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 1.

The optimal neural network includes three neurons in the hidden layer, each of which receive input from all 33 input nodes (31 ROIs, baseline age, and sex) and output onto the final prediction node. Here’s a graphical representation of this caret-trained neural network:

par(mar = numeric(4))
plotnet(ensemble.models$nnet, cex_val=0.8, pad_x=0.6, pos_col="firebrick3", neg_col="dodgerblue4",
        circle_col="lightslategray", bord_col="lightslategray", alpha_val=0.4)

Random forest final model

# Print random forest model summary
ensemble.models$ranger

# Random forest cross-validation plot
splitrule <- ensemble.models$ranger$results$splitrule
numpred <- ensemble.models$ranger$results$mtry
rf.rmse <- ensemble.models$ranger$results$RMSE
rf.mae <- ensemble.models$ranger$results$MAE

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.rf.cv <- data.frame(splitrule, RMSE=rf.rmse, MAE=rf.mae, numpred) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=numpred, y=Value, color=splitrule)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=splitrule)) +
  theme_minimal() +
  ggtitle("Random Forest Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.rf.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "# Predictors in Decision Node", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "# Predictors in Decision Node", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
rm(splitrule, numpred, rf.rmse, rf.mae, p.rf.cv)
## Random Forest 
## 
## 146 samples
##  33 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 131, 132, 133, 131, 130, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE      Rsquared    MAE     
##    2    variance    3.421755  0.05508509  2.766100
##    2    extratrees  3.402943  0.05391511  2.743693
##    4    variance    3.415647  0.05819127  2.769543
##    4    extratrees  3.417787  0.05810918  2.747933
##    6    variance    3.419529  0.05780733  2.777430
##    6    extratrees  3.425549  0.06131056  2.768911
##    8    variance    3.428424  0.07205966  2.784551
##    8    extratrees  3.425459  0.06136747  2.769902
##   10    variance    3.462483  0.05527427  2.802826
##   10    extratrees  3.450699  0.05839774  2.793909
##   13    variance    3.441806  0.06193697  2.782877
##   13    extratrees  3.421396  0.06696843  2.761534
##   15    variance    3.429225  0.06680171  2.779725
##   15    extratrees  3.432775  0.06200092  2.786326
##   17    variance    3.439894  0.06155479  2.800026
##   17    extratrees  3.452706  0.06102057  2.791457
##   19    variance    3.444414  0.06864299  2.793224
##   19    extratrees  3.445687  0.06641398  2.791125
##   21    variance    3.459688  0.06702456  2.793978
##   21    extratrees  3.435185  0.06425679  2.790629
##   24    variance    3.449497  0.05778792  2.782040
##   24    extratrees  3.449979  0.07035871  2.791869
##   26    variance    3.446068  0.06262467  2.783203
##   26    extratrees  3.447919  0.06605339  2.802779
##   28    variance    3.454012  0.06620066  2.797928
##   28    extratrees  3.421732  0.07025218  2.763648
##   30    variance    3.431357  0.07528820  2.774607
##   30    extratrees  3.431973  0.06422735  2.789817
##   33    variance    3.453928  0.06935706  2.789172
##   33    extratrees  3.442609  0.06995728  2.785139
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 2, splitrule = extratrees
##  and min.node.size = 5.

Individual model performance

These four models can be resampled and aggregated using the resamples function:

# Resample the performance of this ensemble and report summary metrics for MAE, RMSE, and R2
set.seed(127)
ensemble.results <- resamples(ensemble.models)
summary(ensemble.results)
## 
## Call:
## summary.resamples(object = ensemble.results)
## 
## Models: nnet, knn, ranger, glmnet 
## Number of resamples: 10 
## 
## MAE 
##            Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## nnet   2.520617 2.655365 2.864441 2.898438 3.134849 3.403922    0
## knn    2.062183 2.537464 2.726949 2.687463 2.838647 3.054930    0
## ranger 2.231983 2.536102 2.719899 2.743693 2.872593 3.480428    0
## glmnet 2.055679 2.674028 2.754094 2.688184 2.849022 2.912111    0
## 
## RMSE 
##            Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## nnet   3.073485 3.407534 3.446786 3.575166 3.819688 4.245425    0
## knn    2.767535 3.088663 3.394956 3.307853 3.518702 3.746856    0
## ranger 2.937279 3.126863 3.385416 3.402943 3.667314 3.917424    0
## glmnet 2.641695 3.163603 3.360819 3.319450 3.610857 3.638297    0
## 
## Rsquared 
##                Min.     1st Qu.     Median       Mean    3rd Qu.      Max. NA's
## nnet   0.0064443211 0.032130174 0.05294404 0.06801171 0.08559669 0.2062706    0
## knn    0.0001298472 0.016446484 0.06269473 0.09405924 0.17185832 0.2314180    0
## ranger 0.0005339141 0.011507785 0.02369916 0.05391511 0.08758365 0.1621449    0
## glmnet 0.0048759280 0.008830315 0.03873983 0.11334288 0.15747674 0.4818376    0

Looking at the mean RMSE across sample iterations, the elastic net (glmnet) has the lowest RMSE; by contrast, the neural network has the highest RMSE. This reports the \(R^2\) values as well, though as the models are non-linear regressions (aside from elastic net), this isn’t a valid comparison metric in this instance.

It’s also useful to look at the regression correlation between these component models:

cat("\nRoot-Mean Square Error Correlation\n")
# Calculate model RMSE correlations
modelCor(ensemble.results, metric="RMSE")
cat("\n\nMean Absolute Error Correlation\n")
# Calculate model MAE correlations
modelCor(ensemble.results, metric="MAE")
## 
## Root-Mean Square Error Correlation
##             nnet       knn    ranger    glmnet
## nnet   1.0000000 0.7663382 0.8052282 0.5832034
## knn    0.7663382 1.0000000 0.9290571 0.9092885
## ranger 0.8052282 0.9290571 1.0000000 0.7361422
## glmnet 0.5832034 0.9092885 0.7361422 1.0000000
## 
## 
## Mean Absolute Error Correlation
##             nnet       knn    ranger    glmnet
## nnet   1.0000000 0.7045263 0.8292703 0.4580054
## knn    0.7045263 1.0000000 0.9080002 0.9045475
## ranger 0.8292703 0.9080002 1.0000000 0.6739868
## glmnet 0.4580054 0.9045475 0.6739868 1.0000000

kNN and random forest show very high error correlation (R>0.98). The other models also show relatively high error correlation; this is not ideal, since ensemble models are designed to counterbalance individual model error. The error can be visualized with confidence intervals using the dotplot function:

# Plot the four models' RMSE values
p.rmse <- as.ggplot(dotplot(ensemble.results, metric="RMSE"))
# Plot the four models' MAE values
p.mae <- as.ggplot(dotplot(ensemble.results, metric="MAE")) 
# Combine plots side-by-side
grid.arrange(p.rmse, p.mae, ncol=2)

These four models show comparable RMSE confidence intervals. The elastic net regression model shows the lowest estimated RMSE, while the neural network shows the highest estimated RMSE. kNN showed the lowest MAE, while the neural network also showed the highest MAE. Despite the large error correlation between the models, I’ll move forward to see if ensembling strengthens the overall predictions for the change in CDR-Sum of Boxes over time.

Model ensemble training and tuning

I will compare three different types of stacked ensembles:

  1. Default caretEnsemble, stacked via generalized linear model-defined component model combination
  2. Stacked ensemble via caretStack using random forest-defined linear combination of component models
  3. Stacked ensemble via caretStack using elastic net-defined linear combination of component models

Genereralized linear ensemble

Starting with the basic caretEnsemble function, which by default employs a generalized linear model to combine the component models:

set.seed(127)
# Set trainControl --> 10-fold CV with parallel processing
ensemble.control <- trainControl(method="repeatedcv", number=10, allowParallel = T)

# Create stacked ensemble, optimizing for RMSE
stacked.ensemble.glm <- caretEnsemble(ensemble.models, metric="RMSE", trControl=ensemble.control)
summary(stacked.ensemble.glm)
## The following models were ensembled: nnet, knn, ranger, glmnet 
## They were weighted: 
## -0.5223 -0.2044 1.2915 -0.1983 1.2208
## The resulting RMSE is: 3.4523
## The fit for each individual model on the RMSE is: 
##  method     RMSE    RMSESD
##    nnet 3.575166 0.3667437
##     knn 3.307853 0.3024359
##  ranger 3.402943 0.3475629
##  glmnet 3.319450 0.3300138

The elastic net model (glmnet) shows the lowest RMSE of the four models. caret offers a function autoplot to create in-depth diagnostic plots for ensemble models:

autoplot(stacked.ensemble.glm)
## Warning in predict.caretList(object$models, newdata = newdata, na.action =
## na.action): Predicting without new data is not well supported. Attempting to
## predict on the training data.

## Warning in predict.caretList(object$models, newdata = newdata, na.action =
## na.action): Predicting without new data is not well supported. Attempting to
## predict on the training data.
## Warning: Duplicated aesthetics after name standardisation: size
## Warning in predict.caretList(object$models, ...): Predicting without new data is
## not well supported. Attempting to predict on the training data.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -0.005
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1.005
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1.01

The top left graph shows the mean cross-validated RMSE per model, with the bars denoting RMSE standard deviation. The elastic net model shows the lowest RMSE of the four, and its RMSE is even lower than that of the full ensemble model (indicated by the red dashed line). The middle left plot shows the relative weight given to each model in a linear combination; the elastic net has the highest weighting, followed by ranger and neural network; kNN actually has a negative weight.

Stacked ensembles

Stacked ensembles can also be constructed using caretStack, which applies user-defined linear combinations of each constituent model. I’ll try one using a random forest combination of models and one using an elastic net combination of models.

set.seed(127)
# Create random forest-combined stacked ensemble
stacked.ensemble.rf <- caretStack(ensemble.models, method = "rf", metric = "RMSE", trControl = ensemble.control)
# Create elastic net-combined stacked ensemble
stacked.ensemble.glmnet <- caretStack(ensemble.models, method="glmnet", metric="RMSE", trControl = ensemble.control)

The summary statistics for each model can be displayed:

cat("\nStacked ensemble, generalized linear model:\n") 
stacked.ensemble.glm 
cat("\n\nStacked ensemble, random forest:\n")
stacked.ensemble.rf
cat("\n\nStacked ensemble, elastic net:\n")
stacked.ensemble.glmnet
## 
## Stacked ensemble, generalized linear model:
## A glm ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## Generalized Linear Model 
## 
## 146 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 130, 132, 132, 130, 133, 131, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   3.452304  0.07549175  2.819455
## 
## 
## 
## Stacked ensemble, random forest:
## A rf ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## Random Forest 
## 
## 146 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 130, 132, 132, 130, 133, 131, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared    MAE     
##   2     3.621656  0.05389468  2.919751
##   3     3.625279  0.05164714  2.932977
##   4     3.675532  0.04818228  2.976000
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
## 
## 
## Stacked ensemble, elastic net:
## A glmnet ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## glmnet 
## 
## 146 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 131, 132, 132, 131, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE      Rsquared    MAE     
##   0.10   0.0008685768  3.431123  0.04351281  2.791480
##   0.10   0.0086857683  3.429494  0.04351063  2.790610
##   0.10   0.0868576832  3.409582  0.04346430  2.778831
##   0.55   0.0008685768  3.431181  0.04351092  2.791515
##   0.55   0.0086857683  3.426229  0.04353186  2.788748
##   0.55   0.0868576832  3.368016  0.04449490  2.751427
##   1.00   0.0008685768  3.431070  0.04351271  2.791485
##   1.00   0.0086857683  3.422757  0.04353742  2.786756
##   1.00   0.0868576832  3.332271  0.05374172  2.725476
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.08685768.

Each of these ensemble models show pretty comparable RMSE and MAE values. The RMSE is slighly lower for the glmnet-combined stack ensemble model, though this difference may not be significant and may not translate to the out-of-sample data. These three ensemble models (glm-, rf-, and glmnet-combined) will be used to predict annual change in CDR-Sum of Boxes in the same test dataset. Note: since the models were created with center and scale preprocessing specified, the test data does not need to be manually pre-processed.

Model predictions

Training data predictions

# Predict based on training data for the four individual models
# And the three stacked ensemble models
glmnet.train <- predict.train(ensemble.models$glmnet)
knn.train <- predict.train(ensemble.models$knn)
nnet.train <- predict.train(ensemble.models$nnet)
rf.train <- predict(ensemble.models$ranger)
ensemble.glm.train <- predict(stacked.ensemble.glm)
## Warning in predict.caretList(object$models, newdata = newdata, na.action =
## na.action): Predicting without new data is not well supported. Attempting to
## predict on the training data.
ensemble.glmnet.train <- predict(stacked.ensemble.glmnet)
## Warning in predict.caretList(object$models, newdata = newdata, na.action =
## na.action): Predicting without new data is not well supported. Attempting to
## predict on the training data.
ensemble.rf.train <- predict(stacked.ensemble.rf)
## Warning in predict.caretList(object$models, newdata = newdata, na.action =
## na.action): Predicting without new data is not well supported. Attempting to
## predict on the training data.
real.train <- original.train$CDRSB

# Combine these predictions into a dataframe for easier viewing
train.df <- do.call(cbind, list(elastic.net=glmnet.train, knn=knn.train, neural.net=nnet.train, random.forest=rf.train,
                                ensemble.glm=ensemble.glm.train, ensemble.glmnet=ensemble.glmnet.train, 
                                ensemble.rf=ensemble.rf.train, real.CDR=real.train)) %>% as.data.frame()

datatable(train.df %>% mutate_if(is.numeric, function(x) round(x,4)) %>% select(real.CDR, elastic.net:ensemble.rf))

Test data predictions

# Predict based on UNSEEN test data for the four individual models
# And the three stacked ensemble models
real.test <- original.test$CDRSB  
glmnet.test <- predict.train(ensemble.models$glmnet, newdata=original.test)
knn.test <- predict.train(ensemble.models$knn, newdata=original.test)
nnet.test <- predict.train(ensemble.models$nnet, newdata=original.test)
rf.test <- predict.train(ensemble.models$ranger, newdata=original.test)
ensemble.glm.test <- predict(stacked.ensemble.glm, newdata=original.test)
ensemble.glmnet.test <- predict(stacked.ensemble.glmnet, newdata=original.test)
ensemble.rf.test <- predict(stacked.ensemble.rf, newdata=original.test)

# Combine these predictions into a dataframe for easier viewing
test.df <- do.call(cbind, list(real.CDR=real.test, elastic.net=glmnet.test, knn=knn.test, neural.net=nnet.test, random.forest=rf.test,
                               ensemble.glm=ensemble.glm.test, ensemble.glmnet=ensemble.glmnet.test, ensemble.rf=ensemble.rf.test)) %>% as.data.frame()

datatable(test.df %>% mutate_if(is.numeric, function(x) round(x,4)))

Predicted vs. real CDR-SoB comparison

I want to compare these three ensemble models in terms of how their predictions relate to the actual CDR-SoB values in the training and testing data. I also want to compare these results with those obtained with the individual component models to see if constructing the ensemble confers a predictive advantage. First, I will visualize how the predicted values stack up to to the actual value for annual change in CDR-Sum of Boxes, in both the training and the test data:

# Create training data ggplot to be converted to interactive plotly plot
p.train <- train.df %>%
  # Reshape to facet on model
  pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
  geom_point(alpha=0.3) +
  facet_grid(.~Model, scales="free") +
  ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Training Data") +
  theme_minimal() +
  theme(legend.position="none") +
  ylab("Predicted CDR-SoB Change") +
  xlab("Actual CDR-SoB Change") +
  theme(plot.title=element_text(hjust=0.5))

# Create test data ggplot to be converted to interactive plotly plot
p.test <- test.df  %>%
  # Reshape to facet on model
  pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
  geom_point(alpha=0.3) +
  facet_grid(.~Model, scales="free") +
  ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Test Data") +
  theme_minimal() +
  theme(legend.position="none") +
  ylab("Predicted CDR-SoB Change") +
  xlab("Actual CDR-SoB Change") +
  theme(plot.title=element_text(hjust=0.5))

# Use ggplotly to create interactive HTML plots
ggplotly(p.train, height=350, width=900)
ggplotly(p.test, height=350, width=900) 

To quantify the association between real CDR-SoB values and model-predicted, I will use the \(R^2\) and RMSE values through the R2 and RMSE functions from caret:

# Calculate the RMSE between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
rmse.train <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.train, real.train),
                         ensemble.glm=RMSE(ensemble.glm.train, real.train),
                         ensemble.rf=RMSE(ensemble.rf.train, real.train),
                         elastic.net=RMSE(glmnet.train, real.train),
                         knn=RMSE(knn.train, real.train),
                         neural.net=RMSE(nnet.train, real.train),
                         random.forest=RMSE(rf.train, real.train),
                         Metric="Train_RMSE")

# Calculate the RMSE between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
rmse.test <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.test, real.test),
                        ensemble.glm=RMSE(ensemble.glm.test, real.test),
                        ensemble.rf=RMSE(ensemble.rf.test, real.test),
                        elastic.net=RMSE(glmnet.test, real.test),
                        knn=RMSE(knn.test, real.test),
                        neural.net=RMSE(nnet.test, real.test),
                        random.forest=RMSE(rf.test, real.test),
                        Metric="Test_RMSE")

# Calculate the R-squared between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
r2.train <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.train, real.train),
                       ensemble.glm=R2(ensemble.glm.train, real.train),
                       ensemble.rf=R2(ensemble.rf.train, real.train),
                       elastic.net=R2(glmnet.train, real.train),
                       knn=R2(knn.train, real.train),
                       neural.net=R2(nnet.train, real.train),
                       random.forest=R2(rf.train, real.train),
                       Metric="Train_R2")

# Calculate the R-squared between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
r2.test <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.test, real.test),
                      ensemble.glm=R2(ensemble.glm.test, real.test),
                      ensemble.rf=R2(ensemble.rf.test, real.test),
                      elastic.net=R2(glmnet.test, real.test),
                      knn=R2(knn.test, real.test),
                      neural.net=R2(nnet.test, real.test),
                      random.forest=R2(rf.test, real.test),
                      Metric="Test_R2")

# Combine all four prediction dataframes into one table to compare and contrast
# RMSE and R-squared across models
do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
  pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
  mutate(Metric = str_replace(Metric, "_", " ")) %>%
  pivot_wider(id_cols="Model", names_from="Metric", values_from="Value") %>%
  mutate_if(is.numeric, function(x) round(x,4)) %>%
  kable(., booktabs=T) %>% kable_styling(full_width=F)
Model Train RMSE Test RMSE Train R2 Test R2
ensemble.glmnet 3.2889 2.9451 0.0454 0.0091
ensemble.glm 3.6042 2.9992 0.1001 0.0037
ensemble.rf 3.5557 2.9160 0.0001 0.0518
elastic.net 3.3282 2.9905 0.0447 0.0017
knn 3.2500 2.8977 0.0652 0.0380
neural.net 2.9036 3.2080 0.2643 0.0188
random.forest 2.0228 2.9338 0.8359 0.0374

The predicted CDR-SoB change values from the random forest individual model were very similar to the actual observed values, yielding an \(R^2\) of 0.94. However, comparing this with the \(R^2\) of 0.16 in the test dataset suggests that the model may be overfit to the training data and does not perform well outside of that dataset. The same can be said of the other models, all of which exhibited substantially larger agreement between predictions and actual values in the training dataset than in the out-of-sample test dataset.

# Compile RMSE and R2 results comparing real vs. predicted values for ensembles and component models
overall.ensemble.results <- do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
  # Reshape to facet on metric -- i.e. RMSE or R2
  pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
  separate(Metric, into=c("Data", "Metric"), sep="_")

p.ensemble.r2.rmse <- overall.ensemble.results %>%
  mutate(Metric = ifelse(Metric=="RMSE", "Real vs. Predicted CDR-SoB RMSE", "Real vs. Predicted CDR-SoB R2")) %>%
  mutate(Data=factor(Data, levels=c("Train", "Test")),
         Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=Data, y=Value, color=Model, group=Model)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  facet_wrap(Metric~., scales="free", nrow=1) +
  theme(strip.text=element_text(size=12, face="bold"),
        axis.title=element_blank())

# Convert to interactive plotly plot, rename x/y axis titles
ggplotly(p.ensemble.r2.rmse) %>% 
  layout(yaxis = list(title = "R2", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "Data Subset", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "Data Subset", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

Re-name objects to have names specific to original dataset:

# Rename relevant objects to have original data-specific names
models.for.ensemble.OG <- ensemble.models
model.metrics.OG <- overall.ensemble.results
stacked.ensemble.glm.OG <- stacked.ensemble.glm
stacked.ensemble.glmnet.OG <- stacked.ensemble.glmnet
stacked.ensemble.rf.OG <- stacked.ensemble.rf

Phase Four: Modeling (PCA Data)

Data split for train + test

As per standard convention in model development, I will randomly partition this dataset into training and testing subsets, such that the models are trained and evaluated in independent data subsets. I will use the sample function with a specific seed set (127) to partition the data into 75% training and 25% testing.

# Set seed for consistency in random sampling for 10-fold cross-validation
set.seed(127)
train.index <- sample(nrow(post.pca), nrow(post.pca)*0.75, replace=F)

# Remove unneccessary identifier info from datasets for modeling
pca.df <- post.pca %>% ungroup() %>% select(-RID, -interval_num)

# Pre-processing will be applied in model training with caret

# Subset training + test data for PCA-transformed data
pca.train <- pca.df[train.index, ]
pca.test <- pca.df[-train.index, ]

Individual model training and tuning

I will be using the caretEnsemble package to compile four individual regression models into a stacked ensemble. This package enables evaluation of the models individually as well as together in the ensemble.

The first step is to create a caretList of the four regression models I will use:

  • Elastic net regression (glmnet)
  • k-nearest neighbors regression (knn)
  • Neural network regression (nnet)
  • Random forest regression (ranger)

I will use the trainControl function to specify ten-fold cross-validation with parallel processing.

# trainControl for 10-fold CV and parallel processing
ensemble.control <- trainControl(method="cv", number=10, allowParallel=T)

I’m using the RMSE as the metric according to which model parameters are tuned. All input data (which includes training and test data) will be automatically standardized using the preProcess center + scaling feature.

# Set seed for consistency
set.seed(127)

# Neural net -- linout=T means linear output (i.e. not constrained to be [0,1]),
# try a range of hidden layer sizes and weight decays
my.neuralnet <- caretModelSpec(method="nnet", linout=T, trace=F, tuneGrid = expand.grid(size=c(1, 3, 5, 10, 15), decay = seq(0, 1, by=0.2)))

# k-nearest neighbors: try 20 different values of k
my.knn <- caretModelSpec(method="knn", tuneLength=20)

# random forest: try 15 different numbers of features considered at each node and use 500 sampled trees
my.randomforest <- caretModelSpec(method="ranger", tuneLength=15, num.trees=500, importance="permutation")

# elastic net: try four different values of alpha for ridge/lasso blending and four lambda values for coefficient penalty
my.elasticnet <- caretModelSpec(method="glmnet",tuneGrid=expand.grid(alpha=c(0,0.1,0.6,1), lambda=c(5^-5,5^-3,5^-1,1)))

# Compile individual models into one cohesive model list using caretList
invisible(capture.output(ensemble.models <- caretList(CDRSB ~ ., 
                                                      data=pca.train, 
                                                      trControl=ensemble.control, 
                                                      metric="RMSE", 
                                                      preProcess=c("center", "scale"),
                                                      tuneList=list(my.neuralnet, my.knn, my.randomforest, my.elasticnet))))
## Warning in trControlCheck(x = trControl, y = target): trControl$savePredictions
## not 'all' or 'final'. Setting to 'final' so we can ensemble the models.
## Warning in trControlCheck(x = trControl, y = target): indexes not defined in
## trControl. Attempting to set them ourselves, so each model in the ensemble will
## have the same resampling indexes.
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

The final chosen parameters for each model can be viewed:

Elastic net final model

# Print elastic net model summary
ensemble.models$glmnet

# Elastic net cross-validation results
glmnet.alpha <- ensemble.models$glmnet$results$alpha
glmnet.rmse <- ensemble.models$glmnet$results$RMSE
glmnet.mae <- ensemble.models$glmnet$results$MAE
glmnet.lambda <- ensemble.models$glmnet$results$lambda

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.glmnet.cv <- data.frame(alpha=glmnet.alpha, RMSE=glmnet.rmse, MAE=glmnet.mae, lambda=glmnet.lambda) %>%
  mutate(alpha=as.character(alpha)) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=lambda, y=Value, color=alpha)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=alpha)) +
  theme_minimal() +
  ggtitle("Elastic Net Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.glmnet.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "lambda", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "lambda", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
# Clear workspace
rm(glmnet.alpha, glmnet.rmse, glmnet.lambda, glmnet.mae, p.glmnet.cv, train.index)
## glmnet 
## 
## 146 samples
##   7 predictor
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 131, 132, 133, 131, 130, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda   RMSE      Rsquared    MAE     
##   0.0    0.00032  3.355121  0.08891140  2.721484
##   0.0    0.00800  3.355121  0.08891140  2.721484
##   0.0    0.20000  3.380694  0.08606252  2.744232
##   0.0    1.00000  3.325818  0.07571810  2.706129
##   0.1    0.00032  3.329055  0.12064771  2.681944
##   0.1    0.00800  3.329981  0.11249497  2.681113
##   0.1    0.20000  3.393470  0.08748235  2.747140
##   0.1    1.00000  3.291948  0.10296992  2.650229
##   0.6    0.00032  3.329604  0.12057776  2.682461
##   0.6    0.00800  3.337280  0.10509960  2.686517
##   0.6    0.20000  3.301948  0.10714980  2.667047
##   0.6    1.00000  3.334947  0.04003456  2.702795
##   1.0    0.00032  3.329607  0.12054436  2.682338
##   1.0    0.00800  3.343129  0.09894212  2.691982
##   1.0    0.20000  3.295485  0.12066904  2.654881
##   1.0    1.00000  3.341064         NaN  2.706431
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 1.

Based on RMSE, lambda=1 was chosen – this is the largest penalty value of the four assessed lambda values. Additionally, alpha=0.6 was chosen, meaning the model is a blend of ridge and lasso regression, leaning slightly toward lasso regression (which uses L1 regularization).

 

kNN final model

# Print kNN model summary
ensemble.models$knn

# kNN cross-validation plot
knn.k <- ensemble.models$knn$results$k
knn.rmse <- ensemble.models$knn$results$RMSE
knn.mae <- ensemble.models$knn$results$MAE

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.knn.cv <- data.frame(k=knn.k, RMSE=knn.rmse, MAE=knn.mae) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=k, y=Value, color=Metric)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line() +
  theme_minimal() +
  ggtitle("kNN Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.knn.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "k", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "k", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
rm(knn.rmse, knn.k, p.knn.cv, knn.mae)
## k-Nearest Neighbors 
## 
## 146 samples
##   7 predictor
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 131, 132, 133, 131, 130, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared    MAE     
##    5  3.509062  0.05444025  2.789745
##    7  3.465840  0.06341470  2.799482
##    9  3.456892  0.05082782  2.796012
##   11  3.442187  0.05649353  2.814016
##   13  3.440380  0.05224399  2.839611
##   15  3.402068  0.05607313  2.797119
##   17  3.418300  0.04307017  2.817122
##   19  3.400647  0.02409538  2.807399
##   21  3.412439  0.02873834  2.809120
##   23  3.401650  0.04664929  2.805127
##   25  3.386162  0.06578378  2.776060
##   27  3.393384  0.07632613  2.779205
##   29  3.382616  0.08482132  2.780261
##   31  3.374153  0.09790613  2.752245
##   33  3.368038  0.10138085  2.737627
##   35  3.373759  0.07538878  2.740293
##   37  3.382404  0.07343515  2.726657
##   39  3.356678  0.09344570  2.712225
##   41  3.354263  0.08906445  2.712796
##   43  3.366859  0.08450353  2.726240
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 41.

k=17 was chosen to minimize the RMSE. This value is also a local minima for the MAE, though the MAE decreases with larger values of k.

 

Neural network final model

# Print neural network model summary
ensemble.models$nnet

# Neural network cross-validation plot
n.neurons <- ensemble.models$nnet$results$size
nnet.rmse <- ensemble.models$nnet$results$RMSE
nnet.mae <- ensemble.models$nnet$results$MAE
nnet.weight <- ensemble.models$nnet$results$decay

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.nnet.cv <- data.frame(n.neurons, RMSE=nnet.rmse, MAE=nnet.mae, decay=nnet.weight) %>%
  mutate(decay=as.character(decay)) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=n.neurons, y=Value, color=decay)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=decay)) +
  theme_minimal() +
  ggtitle("Neural Network Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.nnet.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "# Neurons in Hidden Layer", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "# Neurons in Hidden Layer", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
rm(n.neurons, nnet.rmse, nnet.mae, nnet.weight, p.nnet.cv)
## Neural Network 
## 
## 146 samples
##   7 predictor
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 131, 132, 133, 131, 130, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared    MAE     
##    1    0.0    3.491381  0.10008636  2.824163
##    1    0.2    3.336196  0.09030736  2.732377
##    1    0.4    3.329783  0.08839556  2.722189
##    1    0.6    3.340703  0.08270915  2.726425
##    1    0.8    3.345404  0.08207865  2.739676
##    1    1.0    3.300267  0.10599575  2.696108
##    3    0.0    3.796960  0.11847520  3.090263
##    3    0.2    3.782949  0.05624117  3.038644
##    3    0.4    3.760351  0.05854543  3.067227
##    3    0.6    3.586505  0.03282787  2.967926
##    3    0.8    3.539465  0.03551497  2.875267
##    3    1.0    3.448567  0.07059390  2.800562
##    5    0.0    4.023149  0.10885150  3.250783
##    5    0.2    3.961270  0.07869687  3.161647
##    5    0.4    3.730705  0.06369378  2.991082
##    5    0.6    3.791306  0.04368810  3.022931
##    5    0.8    3.705186  0.03986465  3.011960
##    5    1.0    3.614885  0.05028242  2.943254
##   10    0.0    4.860068  0.11476593  3.751501
##   10    0.2    4.292513  0.09076980  3.396573
##   10    0.4    4.046137  0.05523522  3.206090
##   10    0.6    4.115262  0.04132402  3.220741
##   10    0.8    3.850676  0.05429766  3.073934
##   10    1.0    3.710847  0.05529677  3.009365
##   15    0.0    6.193071  0.09494992  4.815781
##   15    0.2    4.534828  0.05005003  3.726777
##   15    0.4    4.233271  0.03194604  3.337337
##   15    0.6    3.921494  0.03571178  3.156730
##   15    0.8    3.868184  0.04262218  3.119722
##   15    1.0    3.735623  0.05592781  3.002910
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 1.

The optimal neural network based on RMSE includes on neuron in the hidden layer, which receives input from all 33 input nodes (31 ROIs, baseline age, and sex) and outputs onto the final prediction node. The decay weight was chosen as 0.8. Here’s a graphical representation of this caret-trained neural network:

par(mar = numeric(4))
plotnet(ensemble.models$nnet, cex_val=0.8, pad_x=0.6, pos_col="firebrick3", neg_col="dodgerblue4",
        circle_col="lightslategray", bord_col="lightslategray", alpha_val=0.4)

Age at baseline, PC1, PC5, and sex (male) all exert negative weights onto the hidden layer, while PC2, PC3, and PC4 exert positive weights.

Random forest final model

# Print random forest model summary
ensemble.models$ranger

# Random forest cross-validation plot
splitrule <- ensemble.models$ranger$results$splitrule
numpred <- ensemble.models$ranger$results$mtry
rf.rmse <- ensemble.models$ranger$results$RMSE
rf.mae <- ensemble.models$ranger$results$MAE

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.rf.cv <- data.frame(splitrule, RMSE=rf.rmse, MAE=rf.mae, numpred) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=numpred, y=Value, color=splitrule)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=splitrule)) +
  theme_minimal() +
  ggtitle("Random Forest Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.rf.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "# Predictors in Decision Node", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "# Predictors in Decision Node", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
rm(splitrule, numpred, rf.rmse, rf.mae, p.rf.cv)
## Random Forest 
## 
## 146 samples
##   7 predictor
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 131, 132, 133, 131, 130, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE      Rsquared    MAE     
##   2     variance    3.435190  0.05106820  2.752893
##   2     extratrees  3.393176  0.07757258  2.746528
##   3     variance    3.434558  0.05365288  2.743423
##   3     extratrees  3.396129  0.07558109  2.739605
##   4     variance    3.468453  0.06436310  2.778843
##   4     extratrees  3.423763  0.07165595  2.757805
##   5     variance    3.457977  0.06641000  2.772288
##   5     extratrees  3.418851  0.07026874  2.752888
##   6     variance    3.488285  0.06138858  2.790515
##   6     extratrees  3.433786  0.06495634  2.772185
##   7     variance    3.499823  0.05927005  2.795237
##   7     extratrees  3.452706  0.06952847  2.787549
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 2, splitrule = extratrees
##  and min.node.size = 5.

An mtry of 3 with an extratrees split rule yielded the lowest RMSE, meaning three predictor terms are considered at each decision node. This combination also yielded the lowest MAE value.

Individual model performance

These four models can be resampled and aggregated using the resamples function:

# Resample the performance of this ensemble and report summary metrics for MAE, RMSE, and R2
set.seed(127)
ensemble.results <- resamples(ensemble.models)
summary(ensemble.results)
## 
## Call:
## summary.resamples(object = ensemble.results)
## 
## Models: nnet, knn, ranger, glmnet 
## Number of resamples: 10 
## 
## MAE 
##            Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## nnet   2.067302 2.586122 2.709901 2.696108 2.754605 3.424120    0
## knn    2.123687 2.674208 2.819514 2.712796 2.869167 2.982824    0
## ranger 2.250988 2.642377 2.754892 2.746528 2.881790 3.387265    0
## glmnet 1.913407 2.597008 2.738139 2.650229 2.892360 2.945955    0
## 
## RMSE 
##            Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## nnet   2.644517 2.979506 3.320504 3.300267 3.502243 3.986928    0
## knn    2.767943 3.184711 3.484487 3.354263 3.578189 3.715429    0
## ranger 2.851123 3.168851 3.360236 3.393176 3.754003 3.871824    0
## glmnet 2.561769 3.091740 3.424252 3.291948 3.523890 3.732111    0
## 
## Rsquared 
##                Min.     1st Qu.     Median       Mean    3rd Qu.      Max. NA's
## nnet   0.0215908469 0.075432599 0.09478041 0.10599575 0.14579067 0.1843352    0
## knn    0.0001018331 0.018339397 0.06719040 0.08906445 0.09918307 0.3866769    0
## ranger 0.0006989116 0.009395575 0.02792573 0.07757258 0.15564516 0.2138979    0
## glmnet 0.0009809787 0.037518800 0.10055404 0.10296992 0.15256215 0.2265591    0

The values of NA for the glmnet model \(R^2\) indicate that the model outputs all the same predictions, possibly indicating that the model only includes the intercept term after coefficient shrinking. The kNN model has the lowest mean MAE and RMSE out of all the models.

It’s also useful to look at the regression correlation between these component models:

cat("\nRoot-Mean Square Error Correlation\n")
# Calculate model RMSE correlations
modelCor(ensemble.results, metric="RMSE")
cat("\n\nMean Absolute Error Correlation\n")
# Calculate model MAE correlations
modelCor(ensemble.results, metric="MAE")
## 
## Root-Mean Square Error Correlation
##             nnet       knn    ranger    glmnet
## nnet   1.0000000 0.8585108 0.9438624 0.9038053
## knn    0.8585108 1.0000000 0.7950479 0.9654056
## ranger 0.9438624 0.7950479 1.0000000 0.8465626
## glmnet 0.9038053 0.9654056 0.8465626 1.0000000
## 
## 
## Mean Absolute Error Correlation
##             nnet       knn    ranger    glmnet
## nnet   1.0000000 0.8592234 0.8683851 0.8419390
## knn    0.8592234 1.0000000 0.6901874 0.9646991
## ranger 0.8683851 0.6901874 1.0000000 0.7394872
## glmnet 0.8419390 0.9646991 0.7394872 1.0000000

The models are extremely correlated in terms of error (all R>0.95). This is not ideal, since ensemble models are designed to counterbalance individual model error. The error can be visualized with confidence intervals using the dotplot function:

# Plot the four models' RMSE values
p.rmse <- as.ggplot(dotplot(ensemble.results, metric="RMSE"))
# Plot the four models' MAE values
p.mae <- as.ggplot(dotplot(ensemble.results, metric="MAE")) 
# Combine plots side-by-side
grid.arrange(p.rmse, p.mae, ncol=2)

The models all show very similar RMSE confidence intervals. The kNN model shows a lower MAE than the other three models, while the neural network showed the highest MAE value. Despite the large error correlation between the models, I’ll move forward to see if ensembling strengthens the overall predictions for the change in CDR-Sum of Boxes over time.

Model ensemble training and tuning

Genereralized linear ensemble

Starting with the basic caretEnsemble function, which by default employs a generalized linear model to combine the component models:

set.seed(127)
# Set trainControl --> 10-fold CV with parallel processing
ensemble.control <- trainControl(method="repeatedcv", number=10, allowParallel = T)

# Create stacked ensemble, optimizing for RMSE
stacked.ensemble.glm <- caretEnsemble(ensemble.models, metric="RMSE", trControl=ensemble.control)
summary(stacked.ensemble.glm)
## The following models were ensembled: nnet, knn, ranger, glmnet 
## They were weighted: 
## 0.5657 0.7581 -1.485 -0.4368 0.7813
## The resulting RMSE is: 3.3348
## The fit for each individual model on the RMSE is: 
##  method     RMSE    RMSESD
##    nnet 3.300267 0.4272024
##     knn 3.354263 0.3258392
##  ranger 3.393176 0.3823175
##  glmnet 3.291948 0.3601309

All the models show similar RMSE values, but kNN offers the lowest by a slim margin. caret offers a function autoplot to create in-depth diagnostic plots for ensemble models:

autoplot(stacked.ensemble.glm)
## Warning in predict.caretList(object$models, newdata = newdata, na.action =
## na.action): Predicting without new data is not well supported. Attempting to
## predict on the training data.

## Warning in predict.caretList(object$models, newdata = newdata, na.action =
## na.action): Predicting without new data is not well supported. Attempting to
## predict on the training data.
## Warning: Duplicated aesthetics after name standardisation: size
## Warning in predict.caretList(object$models, ...): Predicting without new data is
## not well supported. Attempting to predict on the training data.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The top left graph shows the mean cross-validated RMSE per model, with the bars denoting RMSE standard deviation. The red dashed line shows the RMSE of the ensemble model, which is slightly lower than that of any of the individual models. The middle-left plot shows a large negative weight associated with the elastic net model, smaller weights associated wtih kNN, neural network, and random forest, and a larger positive weight for the intercept.

Stacked ensembles

Stacked ensembles can also be constructed using caretStack, which applies user-defined linear combinations of each constituent model. I’ll try one using a random forest combination of models and one using an elastic net combination of models.

set.seed(127)
# Create random forest-combined stacked ensemble
stacked.ensemble.rf <- caretStack(ensemble.models, method = "rf", metric = "RMSE", trControl = ensemble.control)
# Create elastic net-combined stacked ensemble
stacked.ensemble.glmnet <- caretStack(ensemble.models, method="glmnet", metric="RMSE", trControl = ensemble.control)

The summary statistics for each model can be displayed:

cat("\nStacked ensemble, generalized linear model:\n") 
stacked.ensemble.glm 
cat("\n\nStacked ensemble, random forest:\n")
stacked.ensemble.rf
cat("\n\nStacked ensemble, elastic net:\n")
stacked.ensemble.glmnet
## 
## Stacked ensemble, generalized linear model:
## A glm ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## Generalized Linear Model 
## 
## 146 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 130, 132, 132, 130, 133, 131, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3.334846  0.1347023  2.716977
## 
## 
## 
## Stacked ensemble, random forest:
## A rf ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## Random Forest 
## 
## 146 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 130, 132, 132, 130, 133, 131, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     3.470241  0.1194692  2.839822
##   3     3.481437  0.1016722  2.841097
##   4     3.496424  0.1094410  2.841884
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
## 
## 
## Stacked ensemble, elastic net:
## A glmnet ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## glmnet 
## 
## 146 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 131, 131, 131, 133, 131, 132, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda       RMSE      Rsquared    MAE     
##   0.10   0.001092043  3.307443  0.11030726  2.719166
##   0.10   0.010920427  3.307277  0.11028892  2.719181
##   0.10   0.109204267  3.303952  0.10892320  2.720701
##   0.55   0.001092043  3.307608  0.11027017  2.719368
##   0.55   0.010920427  3.307895  0.10958317  2.721167
##   0.55   0.109204267  3.310901  0.09966416  2.735756
##   1.00   0.001092043  3.307675  0.11025333  2.719438
##   1.00   0.010920427  3.308564  0.10885942  2.723193
##   1.00   0.109204267  3.311359  0.09829005  2.738177
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.1092043.

The elastic net-stacked ensemble model shows the lowest cross-validated RMSE and MAE of the three ensembles, though this difference may not be significant and may not translate to the out-of-sample data. These three ensemble models (glm-, rf-, and glmnet-combined) will be used to predict annual change in CDR-Sum of Boxes in the same test dataset. Note: since the models were created with center and scale preprocessing specified, the test data does not need to be manually pre-processed.

Model predictions

Training data predictions

# Predict based on training data for the four individual models
# And the three stacked ensemble models
glmnet.train <- predict.train(ensemble.models$glmnet)
knn.train <- predict.train(ensemble.models$knn)
nnet.train <- predict.train(ensemble.models$nnet)
rf.train <- predict(ensemble.models$ranger)
ensemble.glm.train <- predict(stacked.ensemble.glm)
## Warning in predict.caretList(object$models, newdata = newdata, na.action =
## na.action): Predicting without new data is not well supported. Attempting to
## predict on the training data.
ensemble.glmnet.train <- predict(stacked.ensemble.glmnet)
## Warning in predict.caretList(object$models, newdata = newdata, na.action =
## na.action): Predicting without new data is not well supported. Attempting to
## predict on the training data.
ensemble.rf.train <- predict(stacked.ensemble.rf)
## Warning in predict.caretList(object$models, newdata = newdata, na.action =
## na.action): Predicting without new data is not well supported. Attempting to
## predict on the training data.
real.train <- pca.train$CDRSB

# Combine these predictions into a dataframe for easier viewing
train.df <- do.call(cbind, list(elastic.net=glmnet.train, knn=knn.train, neural.net=nnet.train, random.forest=rf.train,
                                ensemble.glm=ensemble.glm.train, ensemble.glmnet=ensemble.glmnet.train, 
                                ensemble.rf=ensemble.rf.train, real.CDR=real.train)) %>% as.data.frame()

datatable(train.df %>% mutate_if(is.numeric, function(x) round(x,4)) %>% select(real.CDR, elastic.net:ensemble.rf))

The elastic net individual model clearly predicts the same value (0.3539) for all training cases, indicating it will not be of much use to predict the CDR Sum of Boxes change.

Test data predictions

# Predict based on UNSEEN test data for the four individual models
# And the three stacked ensemble models
real.test <- pca.test$CDRSB  
glmnet.test <- predict.train(ensemble.models$glmnet, newdata=pca.test)
knn.test <- predict.train(ensemble.models$knn, newdata=pca.test)
nnet.test <- predict.train(ensemble.models$nnet, newdata=pca.test)
rf.test <- predict.train(ensemble.models$ranger, newdata=pca.test)
ensemble.glm.test <- predict(stacked.ensemble.glm, newdata=pca.test)
ensemble.glmnet.test <- predict(stacked.ensemble.glmnet, newdata=pca.test)
ensemble.rf.test <- predict(stacked.ensemble.rf, newdata=pca.test)

# Combine these predictions into a dataframe for easier viewing
test.df <- do.call(cbind, list(real.CDR=real.test, elastic.net=glmnet.test, knn=knn.test, neural.net=nnet.test, random.forest=rf.test,
                               ensemble.glm=ensemble.glm.test, ensemble.glmnet=ensemble.glmnet.test, ensemble.rf=ensemble.rf.test)) %>% as.data.frame()

datatable(test.df %>% mutate_if(is.numeric, function(x) round(x,4)))

Predicted vs. real CDR-SoB comparison

I want to compare these three ensemble models in terms of how their predictions relate to the actual CDR-SoB values in the training and testing data. I also want to compare these results with those obtained with the individual component models to see if constructing the ensemble confers a predictive advantage. First, I will visualize how the predicted values stack up to to the actual value for annual change in CDR-Sum of Boxes, in both the training and the test data:

# Create training data ggplot to be converted to interactive plotly plot
p.train <- train.df %>%
  # Reshape to facet on model
  pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
  geom_point(alpha=0.3) +
  facet_grid(.~Model, scales="free") +
  ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Training Data") +
  theme_minimal() +
  theme(legend.position="none") +
  ylab("Predicted CDR-SoB Change") +
  xlab("Actual CDR-SoB Change") +
  theme(plot.title=element_text(hjust=0.5))

# Create test data ggplot to be converted to interactive plotly plot
p.test <- test.df  %>%
  # Reshape to facet on model
  pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
  geom_point(alpha=0.3) +
  facet_grid(.~Model, scales="free") +
  ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Test Data") +
  theme_minimal() +
  theme(legend.position="none") +
  ylab("Predicted CDR-SoB Change") +
  xlab("Actual CDR-SoB Change") +
  theme(plot.title=element_text(hjust=0.5))

# Use ggplotly to create interactive HTML plots
ggplotly(p.train, height=350, width=900)
ggplotly(p.test, height=350, width=900) 

To quantify the association between real CDR-SoB values and model-predicted, I will use the \(R^2\) and RMSE values through the R2 and RMSE functions from caret:

# Calculate the RMSE between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
rmse.train <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.train, real.train),
                         ensemble.glm=RMSE(ensemble.glm.train, real.train),
                         ensemble.rf=RMSE(ensemble.rf.train, real.train),
                         elastic.net=RMSE(glmnet.train, real.train),
                         knn=RMSE(knn.train, real.train),
                         neural.net=RMSE(nnet.train, real.train),
                         random.forest=RMSE(rf.train, real.train),
                         Metric="Train_RMSE")

# Calculate the RMSE between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
rmse.test <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.test, real.test),
                        ensemble.glm=RMSE(ensemble.glm.test, real.test),
                        ensemble.rf=RMSE(ensemble.rf.test, real.test),
                        elastic.net=RMSE(glmnet.test, real.test),
                        knn=RMSE(knn.test, real.test),
                        neural.net=RMSE(nnet.test, real.test),
                        random.forest=RMSE(rf.test, real.test),
                        Metric="Test_RMSE")

# Calculate the R-squared between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
r2.train <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.train, real.train),
                       ensemble.glm=R2(ensemble.glm.train, real.train),
                       ensemble.rf=R2(ensemble.rf.train, real.train),
                       elastic.net=R2(glmnet.train, real.train),
                       knn=R2(knn.train, real.train),
                       neural.net=R2(nnet.train, real.train),
                       random.forest=R2(rf.train, real.train),
                       Metric="Train_R2")

# Calculate the R-squared between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
r2.test <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.test, real.test),
                      ensemble.glm=R2(ensemble.glm.test, real.test),
                      ensemble.rf=R2(ensemble.rf.test, real.test),
                      elastic.net=R2(glmnet.test, real.test),
                      knn=R2(knn.test, real.test),
                      neural.net=R2(nnet.test, real.test),
                      random.forest=R2(rf.test, real.test),
                      Metric="Test_R2")

# Combine all four prediction dataframes into one table to compare and contrast
# RMSE and R-squared across models
do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
  pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
  mutate(Metric = str_replace(Metric, "_", " ")) %>%
  pivot_wider(id_cols="Model", names_from="Metric", values_from="Value") %>%
  mutate_if(is.numeric, function(x) round(x,4)) %>%
  kable(., booktabs=T) %>% kable_styling(full_width=F)
Model Train RMSE Test RMSE Train R2 Test R2
ensemble.glmnet 3.6721 3.2232 0.1763 0.0009
ensemble.glm 3.7837 3.2671 0.2501 0.0014
ensemble.rf 3.8638 3.2445 0.0645 0.0011
elastic.net 3.2558 3.4432 0.0635 0.0017
knn 3.2878 2.9774 0.0516 0.0001
neural.net 3.1738 3.0410 0.1022 0.0141
random.forest 2.0513 2.8974 0.8569 0.0458

The random forest model shows the lowest RMSE between predicted vs. real CDR-Sum of Boxes for both the training and test data. It also has the highest \(R^2\) value between predicted and real CDR-Sum of Boxes rate of change. However, the training \(R^2\) of 0.8860 is substantially larger than the test data \(R^2\) of 0.1752, suggesting major overfitting.

In the training dataset, the random forest predictions are pretty close, with the glm- and elastic net-stacked ensemble models showing nearly comparable correlations between real vs. predicted values. However, this relationship is not evident in the testing data. Aside from the elastic net individual model, the individual and ensemble models all tend ot over-estimate CDR-Sum of Boxes for test cases with zero actual change. Beyond zero, however, the models tend to underestimate CDR-Sum of Boxes change, particular for values of ~1.5 and above.

# Compile RMSE and R2 results comparing real vs. predicted values for ensembles and component models
overall.ensemble.results <- do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
  # Reshape to facet on metric -- i.e. RMSE or R2
  pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
  separate(Metric, into=c("Data", "Metric"), sep="_")

p.ensemble.r2.rmse <- overall.ensemble.results %>%
  mutate(Metric = ifelse(Metric=="RMSE", "Real vs. Predicted CDR-SoB RMSE", "Real vs. Predicted CDR-SoB R2")) %>%
  mutate(Data=factor(Data, levels=c("Train", "Test")),
         Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=Data, y=Value, color=Model, group=Model)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  facet_wrap(Metric~., scales="free", nrow=1) +
  theme(strip.text=element_text(size=12, face="bold"),
        axis.title=element_blank())

# Convert to interactive plotly plot, rename x/y axis titles
ggplotly(p.ensemble.r2.rmse) %>% 
  layout(yaxis = list(title = "R2", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "Data Subset", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "Data Subset", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

There are two clusters of model performance here: random forest, elastic net-, and glm-stacked ensemble models show high \(R^2\) in the training data, and all decrease to a very similar quantity (0.15-0.17) in the test dataset. All three show the lowest RMSE as well. By contrast, the neural network, kNN, and random forest-stacked ensemble models show low \(R^2\) and high RMSE. Interestingly, their error is higher in the training dataset.

Re-name objects to have names specific to the PCA-transformed dataset:

# Rename relevant objects to have PCA-transformed data-specific names
models.for.ensemble.PCA <- ensemble.models
model.metrics.PCA <- overall.ensemble.results
stacked.ensemble.glm.PCA <- stacked.ensemble.glm
stacked.ensemble.glmnet.PCA <- stacked.ensemble.glmnet
stacked.ensemble.rf.PCA <- stacked.ensemble.rf

Phase Five: Model Evaluation

I’ll evaluate model performance across the two ROI aggregation types:

  • Original data
  • PCA-transformed data

Here, I will define “model performance” as the RMSE and R\(^2\) for agreement between predicted vs. actual CDR-Sum of Boxes change per year in the test dataset. There are seven models utilized, including four individual models and three stacked ensembles.

model.metrics.OG$ROI_Type <- "Original"
model.metrics.PCA$ROI_Type <- "PCA"

# Concatenate the four dataframes into one dataframe
model.metrics.full <- do.call(plyr::rbind.fill, list(model.metrics.OG,
                                                     model.metrics.PCA))

Predicted vs. real CDR-SoB comparison

I’ll use bar plots to visualize the RMSE and R\(^2\) metrics for each of the seven models:

# Define two distinct color palettes using hex colors
rmse.pal <- c("#C0D3D9", "#7DBCDD", "#2D69A5", "#2A486C")
r2.pal <- c("#fcb9b2", "#f67e7d", "#843b62", "#621940")

# Create static ggplot
p.rmse <- model.metrics.full %>%
  # Plot the RMSE for test data predictions
  filter(Data=="Test", Metric=="RMSE") %>%
  # Show models in specific order on x-axis
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=Model, y=Value, fill=ROI_Type)) +
  geom_bar(stat="identity", position=position_dodge()) +
  labs(fill="ROI Type") +
  theme_minimal() +
  ggtitle("RMSE Between Predicted vs. Actual CDR-SoB") +
  # Use pre-defined color palette
  scale_fill_manual(values=rmse.pal) +
  ylab("RMSE") +
  theme(axis.text.x = element_text(angle=45, hjust=1),
        plot.title=element_text(hjust=0.5),
        strip.text = element_text(size=12, face="bold"),
        legend.position = "bottom")

# Convert to plotly interactive visualizatoin
ggplotly(p.rmse, width=700, height=300)
p.r2 <- model.metrics.full %>%
  # Plot the R2 for test data predictions
  filter(Data=="Test", Metric=="R2") %>%
  # Show models in specific order on x-axis
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=Model, y=Value, fill=ROI_Type)) +
  geom_bar(stat="identity", position=position_dodge()) +
  labs(fill="ROI Type") +
  theme_minimal() +
  ggtitle("R2 Between Predicted vs. Actual CDR-SoB") +
  # Use pre-defined color palette
  scale_fill_manual(values=r2.pal) +
  ylab("R2") +
  theme(axis.text.x = element_text(angle=45, hjust=1),
        plot.title=element_text(hjust=0.5),
        strip.text = element_text(size=12, face="bold"),
        legend.position = "bottom")
ggplotly(p.r2, width=700, height=300)

The RMSE doesn’t afford much distinction either by ROI averageing type or by predictive model. If anything, the RMSE does stand out as a bit larger for the original ROI configuration in the random forest-stacked ensemble (ensemble.rf) and the neural network (neural.net). However, the R\(^2\) agreement between predicted vs. actual CDR-SoB does differentiate among the models. The random forest model (random.forest) actually shows the highest R\(^2\) value for each the two ROI configurations, surpassing all three of the stacked ensemble models. The PCA-transformed data shows the largest R\(^2\) agreement of the four ROI configurations, followed closely by the original data configuration; however, even these values are <0.2, suggesting minimal association between the predictions and the actual values.

Interpretation

At the end of the day, none of the models and corresponding ROI configurations offered strong predictions for CDR-Sum of Boxes annual change based on regional changes in tau-PET tracer uptake. To improve this model in the future, I would consider including the baseline tau-PET SUVR value in a mixed-effects model; conceivably, an increase in 0.5 SUVR could have different implications for a subject with no tau-PET uptake at baseline versus a subject with high tau-PET uptake at baseline. Also, it’s entirely possible that there is a temporal lag between tau-PET uptake and cognitive decline; for example, tau accumulation in a given ROI (e.g. hippocampus) may precede cognitive decline by months or years, which would not be detected in this model. This is a complex pathophysiological dynamic that necessitates complex modeling supported by extensive domain knowledge.

That being said, I am interested in a yet-unexamined facet of this project: how does a given region of interest influence a given model? How important is e.g. the hippocampus relative to e.g. the insula?

Variable contributions

To answer these questions, I will use the varImp function from caret, which computes the relative importance of each variable used as model input. Since variable importance is not readily identifiable for caretStack ensembles, I’ll focus exclusively on individual models. Variable importance doesn’t really apply for k-Nearest Neighbors (kNN), so I will examine the random forest, elastic net, and neural network regression models. For elastic net regression, I will also extract ROI coefficients.

First, I’ll examine the original ROI conformation:

# Extract variable importance from random forest regression model
rf.var.imp <- varImp(models.for.ensemble.OG$ranger)[[1]] %>%
  rownames_to_column(var="Term") %>%
  dplyr::rename("Importance"="Overall") %>%
  mutate(Model="Random Forest")

# Extract variable importance from elastic net regression model
glmnet.var.imp <- varImp(models.for.ensemble.OG$glmnet)[[1]] %>%
  rownames_to_column(var="Term") %>%
  dplyr::rename("Importance"="Overall") %>%
  mutate(Model="Elastic Net")

# Extract variable coefficients from optimal elastic net regression model
glmnet.var.coef <- as.data.frame(as.matrix(coef(models.for.ensemble.OG$glmnet$finalModel, models.for.ensemble.OG$glmnet$bestTune$lambda))) %>%
  rownames_to_column(var="Term") %>% dplyr::rename("Coefficient"="1") %>%
  mutate(Model="Elastic Net")

# Combine importance + coefficients for elastic net regression dataframe
glmnet.var <- left_join(glmnet.var.imp, glmnet.var.coef)
## Joining, by = c("Term", "Model")
# Extract variable importance from neural network regression model
nnet.var.imp <- varImp(models.for.ensemble.OG$nnet)[[1]] %>%
  rownames_to_column(var="Term") %>%
  dplyr::rename("Importance"="Overall") %>%
  mutate(Model="Neural Network")

# Concatenate these four dataframes into one dataframe
og.importance <- do.call(plyr::rbind.fill, list(rf.var.imp, glmnet.var, nnet.var.imp))
datatable(og.importance)
# Pre-define color palette
my.pal <- colorRampPalette(c("#C0D3D9", "#7DBCDD", "#2D69A5", "#2A486C"))(200)

# Plot the variable importance for each model
p.importance <- og.importance %>%
  ggplot(data=., mapping=aes(x=Term, y=Importance, fill=Importance)) +
  geom_bar(stat="identity") +
  # Facet by model, one model per row
  facet_wrap(Model ~ ., scales="free_y", nrow=3) +
  theme_minimal() +
  scale_fill_gradientn(colors=my.pal) +
  # Rotate x-axis text for legibility
  theme(axis.text.x = element_text(angle=90, hjust=1),
        axis.title=element_blank(),
        strip.text = element_text(size=14))

# Convert to interactive plotly visualization
ggplotly(p.importance, width=800, height=700) %>%
  layout(xaxis = list(title = "Model Term", 
                      titlefont = list(size = 14)), 
         yaxis=list(title="Importance", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Importance", 
                       titlefont = list(size = 12)),
         yaxis3 = list(title="Importance", 
                       titlefont = list(size = 12)))

It’s really interesting how differently these three models weigh each model term. The elastic net model placed some of the largest importance in the bankssts, entorhinal, hippocampus, and inferiortemporal ROIs, which typically show some of the heaviest tau pathology burdens in the human Alzheimer’s Disease brain. The neural network found sex to be the most important predictive variable by far, which is surprising and may contribute to the relatively low predictive accuracy associated with the neural network. The random forest identified the inferiorparietal, isthmuscingulate, insula, and parahippocampal ROIs as the most important predictor terms.

Variable coefficients

I’m curious as to how variable importance relates to the regularized coefficient calculated for each predictor term in the elastic net model:

# Pre-define color palette
my.pal <- colorRampPalette(c("#fcb9b2", "#f67e7d", "#843b62", "#621940"))(200)

# Plot variable coefficients and fill with variable importance
p.enet <- og.importance %>%
  filter(Model=="Elastic Net") %>%
  # Rearrange term by coefficient using forcats::fct_reorder
  ggplot(data=., mapping=aes(x=fct_reorder(Term, Coefficient,), y=Coefficient, 
                             fill=Importance, label=Term)) +
  geom_bar(stat="identity") +
  xlab("Term") +
  theme_minimal() +
  scale_fill_gradientn(colors=my.pal) +
  ggtitle("Variable Coefficients & Importance in Elastic Net Regression") +
  theme(axis.text.x=element_text(angle=90),
        plot.title=element_text(hjust=0.5))

# tooltip specifies only to show the label (term), y (coefficient), and fill (importance) upon cursor hover
ggplotly(p.enet, tooltip=c("label", "y", "fill"), width=700, height=400)

The most important features bookend the x-axis, and also have the largest magnitude of coefficients. It’s pretty surprising to note that the entorhinal cortex and hippocampus have two of the most negative coefficients of all the variables; a negative coefficient means that rate of tau accumulation in that ROI is negatively associated with CDR-Sum of Boxes score increase. The entorhinal cortex and hippocampus are two of the first gray matter regions to develop tau pathology in Alzheimer’s Disease according to the Braak staging paradigm, so I would expect that increased tau accumulation in these regions would be associated with increased CDR-Sum of Boxes scores, not a decrease (note: a higher CDR-Sum of Boxes score means greater cognitive impairment). One possible interpretation could be related to the temporal gap between tau neurofibrillary tangle accumulation and cognitive decline as described above; though further assessment would be warranted to interpret the implications of these negative coefficients.

Regional importance and coefficients

I like to use the ggseg package to visualize these elastic net regression coefficients and variable importance metrics within the brain space:

# Read in the aparc (cortical parcellation) ROIs
ggseg.aparc <- read_excel("RData/ggseg_roi.xlsx", sheet=1) %>%
  mutate(Braak=as.numeric(as.roman(Braak)))

# Read in the aseg (subcortical segmentation) ROIs
ggseg.aseg <- read_excel("RData/ggseg_roi.xlsx", sheet=2) %>%
  mutate(Braak=as.numeric(as.roman(Braak)))

# Pre-defined color palette
my.pal <- colorRampPalette(c("#fcb9b2", "#f67e7d", "#843b62", "#621940"))(200)

# Plot ROI importance in brain using ggseg
p.roi.imp <- og.importance %>%
  filter(Model=="Elastic Net") %>%
  # Remove the non-brain ROI terms
  filter(!(Term %in% c("Sex_Male", "Age_Baseline"))) %>%
  left_join(., ggseg.aparc, by=c("Term"="tau_ROI")) %>%
  rename("region"="ggseg_ROI") %>%
  filter(!is.na(region)) %>%
  # Convert to brain representation using Desikan-Killiany (dk) atlas
  ggseg(atlas="dk", mapping=aes(fill=Importance, label=region)) +
  ggtitle("Elastic Net ROI Importance") +
  # Label left and right hemispheres
  annotate(geom="text", x=410, y=-100, label="Left") +
  annotate(geom="text", x=1290, y=-100, label="Right") +
  scale_fill_gradientn(colors=my.pal) +
  # Use calibri font
  theme(axis.text=element_blank(),
        axis.title.x = element_text(family="calibri"),
        legend.title=element_text(family="calibri"),
        legend.text=element_text(family="calibri"),
        plot.title=element_text(hjust=0.5, family="calibri"),
        text=element_text(family="calibri"))
## Warning: Ignoring unknown aesthetics: label
# Convert to interactive plotly visualization
ggplotly(p.roi.imp, dynamicTicks = T, tooltip=c("label", "fill"), width=700, height=300)

One interesting observation is that the ROIs with the largest relative importance are located at the lateral edges of the cortex rather than the medial edge, with the exception of the entorhinal and pericalcarine cortices.

The elastic net coefficient weights can also be viewed in the brain:

# Plot ROI coefficient in brain
p.roi.coef <- og.importance %>%
  filter(Model=="Elastic Net") %>%
  # Remove the non-brain ROI terms
  filter(!(Term %in% c("Sex_Male", "Age_Baseline"))) %>%
  left_join(., ggseg.aparc, by=c("Term"="tau_ROI")) %>%
  rename("region"="ggseg_ROI") %>%
  filter(!is.na(region)) %>%
  # Convert to brain representation using Desikan-Killiany (dk) atlas
  ggseg(atlas="dk", mapping=aes(fill=Coefficient, label=region)) +
  ggtitle("Elastic Net ROI Coefficients") +
  # Label left and right hemispheres
  annotate(geom="text", x=410, y=-100, label="Left") +
  annotate(geom="text", x=1290, y=-100, label="Right") +
  scale_fill_continuous_divergingx(palette = 'RdBu', rev=T, mid = 0) +
  # Use calibri font
  theme(axis.text=element_blank(),
        axis.title.x = element_text(family="calibri"),
        legend.title=element_text(family="calibri"),
        legend.text=element_text(family="calibri"),
        plot.title=element_text(hjust=0.5, family="calibri"),
        text=element_text(family="calibri"))
## Warning: Ignoring unknown aesthetics: label
# Convert to interactive plotly visualization
ggplotly(p.roi.coef, dynamicTicks = T, tooltip=c("label", "fill"), width=700, height=300)

Phase Six: Deployment

Data reshaping

Combine original train + original test into full df:

original.full <- plyr::rbind.fill(original.train, original.test)

Pivot the original ROI configuration dataset from wide to long (preparation for correlation matrix), omitting age at baseline and sex:

# Pivot wide --> long
original.roi.long <- original.full %>%
  select(-Age_Baseline, -Sex_Male) %>%
  pivot_longer(cols=c(-CDRSB), names_to="ROI", values_to="SUVR_Change")

Pairwise correlations

Calculate pairwise ROI tau-PET SUVR correlations using rcorr function from the Hmisc package:

# Convert tau-PET regional SUVR change to a matrix
original.roi.mat <- original.full %>%
  select(-Age_Baseline, -Sex_Male, -CDRSB) %>%
  as.matrix()

# Use rcorr from Hmisc to calculate Pearson correlation coefficients ($r)
original.roi.corr <- rcorr(original.roi.mat)
original.roi.corr.coef <- original.roi.corr$r

For visualization, it’s easier to then convert this correlation matrix from wide to long. I’ll utilize the pivot_longer function from the tidyr package to reshape the dataframes for the correlation coefficients.

# Correlation coefficient (Pearson)
original.roi.corr.coef.long <- as.data.frame(original.roi.corr.coef) %>%
  # Cast rownames as a dataframe column
  rownames_to_column(var="ROI1") %>%
  # Pivot data wide --> long
  pivot_longer(cols=c(-ROI1), names_to="ROI2", values_to="Pearson_Corr") %>%
  # Omit correlations between the same ROI, which are always equal to 1
  filter(ROI1 != ROI2) %>%
  # Omit instances where the two ROIs are the same, just in different columns
  mutate(ROI12 = ifelse(ROI1<ROI2, paste(ROI1, ROI2, sep="_"),
                        paste(ROI2, ROI1, sep="_"))) %>%
  distinct(ROI12, .keep_all=T) %>%
  select(-ROI12)


# Save the final long-format dataframe
original.roi.corr.results <- original.roi.corr.coef.long

iGraph preparation

# Edges are defined as cortical lobe --> specific ROI connection
edges <- read.csv("6_Deployment/tau_roi_nodes.csv") %>% distinct()
# ROIs don't include the origin --> cortical lobe connection
rois <- edges %>% filter(!(to %in% c("Cingulate", "Frontal", "Insula",
                                     "Occipital", "Parietal", "Temporal")))

# Create a dataframe of vertices, one line per object in the ROI cortical lobe hierarchy
vertices = data.frame(name = unique(c(as.character(edges$from), as.character(edges$to))))
vertices$group <- edges$from[match(vertices$name, edges$to)]

# Create an igraph object
mygraph <- graph_from_data_frame(edges, vertices=vertices)

Save data

Save this data to an .RData file to be loaded into the Shiny app:

save(original.roi.corr.coef, og.importance, original.roi.corr.results, ggseg.aparc, ggseg.aseg, rois, vertices, mygraph, file="RData/shiny_data.RData")